# Which project proposal is going to be approved? DonorsChoose Data Set Analisys

Welcome to my DonorsChoose exploratory data analisys notebook!

This work aims to help to answer the above question "which project proposal is going to be approved?", that is the main point of the contest.

In order to achieve good anwsers, this notebook will bring some visualizations and insights about the data.

Enjoy!

## Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

train = pd.read_csv('../input/train.csv', parse_dates=['project_submitted_datetime'])
test = pd.read_csv('../input/test.csv', parse_dates=['project_submitted_datetime'])
resources = pd.read_csv('../input/resources.csv')
sample_sub = pd.read_csv('../input/sample_submission.csv')

## Data First Look

Let's take a look into our data sets structures in order to see exploratory analisys possibilities.

The data sets are mostly composed by textual features and some numerical features. This means that a lot of pre-processing is needed before fitting the data with a "regular classifier", e.g., tree-based ones or logistic regressions, that depends of a tabulated data format.

Such pre-processing can be something that goes from a label encoding to a Term Frequency–Inverse Document Frequency (TF-IDF).

![](http://)The **target variable** is represented by the feature "**project_is_approved**".

In [None]:
print('Train shape: {}\nTest shape: {}\nResources shape: {}'.format(train.shape, test.shape, resources.shape))
print('Approved projects: {}%'.format(round(train['project_is_approved'].mean()*100, 2)))

In [None]:
print('Train set - First 5 rows')
train.head()

In [None]:
print('Resources - First 5 rows')
resources.head()

## Merging Data
Train and Test sets have an attribute in common with the Resourcers set, which is the "id" columns. Let's merge the data in order to work with less files.

In [None]:
train = pd.merge(train, resources, on='id', how='left', sort=False)
test = pd.merge(test, resources, on='id', how='left', sort=False)

## Visualizations
Relation between 'approved projects' and 'not approved projects' in the training data set, which is the contest  target


In [None]:
def absolute_value(val):
    return str(np.round(val,2)) + '%'

plt.figure(figsize=(8,8))
plt.pie(train['project_is_approved'].value_counts(), explode = (0.1, 0),
        labels=['Approved Projects', 'Not Approved Projects'], autopct=absolute_value, shadow=True)
plt.title('Target "project_is_approved" distribution over training data', fontsize=12)
plt.legend(fontsize=12)
plt.show()

Projects proposals may have requested ONE or MORE items in the "quantity" column. In order to evaluate the "price" column, the "quantity" will be considered as well.

Projects proposals prices histograms considering just one item by project and more than one item by project.

In [None]:
print('--- Price description ---')
train['price'].describe()

In [None]:
plt.figure(figsize=(12,8))

plt.subplot(121)
p = plt.pie((train['quantity']>1).value_counts(), explode = (0.1, 0),
        labels=['ONE Item', 'More than ONE Item'], autopct=absolute_value, shadow=True, startangle=-150)
p = plt.title('Feature "quantity" distribution over TRAINING data', fontsize=12)
p = plt.legend(fontsize=12)

plt.subplot(122)
p = plt.pie((test['quantity']>1).value_counts(), explode = (0.1, 0),
        labels=['ONE Item', 'More than ONE Item'], autopct=absolute_value, shadow=True, startangle=-150)
p = plt.title('Feature "quantity" distribution over TESTING data', fontsize=12)
p = plt.legend(fontsize=12)

plt.show()

More than 33% of projects' proposals have at least 2 items in the request. And both training and testing distributions are strongly similar. So it is very important to consider multiple items when visualizing the data and creating 

In [None]:
plt.figure(figsize=(12,10))

plt.subplot(211)
p = sns.distplot(np.log1p(train['price']), bins=100, label='Price (ONE item)')
p = plt.title('Projects\' price (just one \'item\')',fontsize=12)
p = plt.ylim(0, 0.5)
p = plt.xlim(0,10)
p = plt.legend(fontsize=12)
p = plt.ylabel('Normalized Frequency', fontsize=12)
p = plt.xlabel('Log1p price', fontsize=12)

plt.subplot(212)
p1 = sns.distplot(np.log1p(train['price'] * train['quantity']), bins=100, label='Price (ALL items)')
p1 = plt.title('Projects\' price (considering \'quantity\')',fontsize=12)
p1 = plt.ylim(0, 0.5)
p1 = plt.xlim(0,10)
p1 = plt.legend(fontsize=12)
p1 = plt.ylabel('Normalized Frequency', fontsize=12)
p1 = plt.xlabel('Log1p price', fontsize=12)

plt.show()

Looking to these histograms, when considering or not the "item quantity", we can see that a lot of projects don't really have a price. It is represented by the first alone bar in both plots.

It is also clear that there are less projects are really "expensive" than the chepear ones, because the right tail of the density line is longer than the left one.

Let's take a look just in density plots, in order to better compare them.

In [None]:
plt.figure(figsize=(12,5))

sns.distplot(np.log1p(train['price']), bins=100, label='Price (ONE item)', hist=False)
sns.distplot(np.log1p(train['price'] * train['quantity']), bins=100, label='Price (ALL items)', hist=False)
plt.title('Projects\' price (just one \'item\' and considering \'quantity\')',fontsize=12)
plt.legend(fontsize=12)
plt.ylim(0, 0.5)
plt.ylabel('Normalized Frequency', fontsize=12)
plt.xlabel('Log1p price', fontsize=12)

plt.show()

The "ALL items price" density plot is a little smotthier than the plot representing the price for just one item in the project proposal.
The "ALL items price" plot is more "loyal" to analise the data, because it represents the real cost of each project.
In that way, the plot shows less "cheaper" projects than one could thinked in a first look into the data.

## Project approved/not approved comparison
We've just looked to projects prices indiscriminately. Now, it is time consider the target columns when plotting.

In [None]:
plt.figure(figsize=(12,6))

plt.subplot(121)
p = sns.distplot(np.log1p(train[train['project_is_approved']==1]['price']), bins=75, label='Approved Projects (ONE)')
p = sns.distplot(np.log1p(train[train['project_is_approved']==0]['price']), bins=75, label='NOT Approved Projects (ONE)')
p = plt.xlim(0,10)
p = plt.ylim(0,0.5)
p = plt.title('Projects\' price ONE item',fontsize=12)
p = plt.legend(fontsize=12)
p = plt.ylabel('Normalized Frequency', fontsize=12)
p = plt.xlabel('Log1p price', fontsize=12)

plt.subplot(122)
p1 = sns.distplot(np.log1p(train[train['project_is_approved']==1]['price'] * train[train['project_is_approved']==1]['quantity']), bins=75, label='Approved Projects (ALL)')
p1 = sns.distplot(np.log1p(train[train['project_is_approved']==0]['price'] * train[train['project_is_approved']==0]['quantity']), bins=75, label='NOT Approved Projects (ALL)')
p1 = plt.xlim(0,10)
p1 = plt.ylim(0,0.5)
p1 = plt.title('Projects\' price considering ALL items',fontsize=12)
p1 = plt.legend(fontsize=12)
p1 = plt.ylabel('Normalized Frequency', fontsize=12)
p1 = plt.xlabel('Log1p price', fontsize=12)

plt.show()

After looking to the histograms, it looks like higher costs does not lead projects to be rejected. In this side by side plot, 

Maybe other transformations can be applied to "price" feature in other.

## Project amount and prices by State
Since we have the state where projects' proposal came from, let's analyse each pleace price distribution.

In [None]:
plt.figure(figsize=(12,18))

train['log1p_price'] = np.log1p(train['price'])
train['log1p_price_x_quantity'] = np.log1p(train['price'] * train['quantity'])

plt.subplot(311)
gb_train_count = train.groupby(['school_state']).count().reset_index()
gb_train_count.sort_values(by='project_is_approved', inplace=True, ascending=False)
p = sns.barplot(x=gb_train_count['school_state'], y=gb_train_count['project_is_approved'])
p = plt.title('Projects barplot of Projects\' school state',fontsize=12)
p = plt.ylabel('Total number of projects proposals', fontsize=12)
p = plt.xlabel('School State', fontsize=12)

plt.subplot(312)
order = train.groupby(['school_state'])['project_is_approved'].count().sort_values()[::-1].index
p = sns.boxplot(x='school_state', y='log1p_price', data=train, order=order)
p = plt.title('Price boxplots of Projects\' school state (ONE item)',fontsize=12)
p = plt.ylabel('Log1p price', fontsize=12)
p = plt.xlabel('School State', fontsize=12)

plt.subplot(313)
order = train.groupby(['school_state'])['project_is_approved'].count().sort_values()[::-1].index
p1 = sns.boxplot(x='school_state', y='log1p_price_x_quantity', data=train, order=order)
p1 = plt.title('Price boxplots of Projects\' school state (ALL items)',fontsize=12)
p1 = plt.ylabel('Log1p price', fontsize=12)
p1 = plt.xlabel('School State', fontsize=12)

plt.show()

Both boxplots above have been sorted by projects count to a better visualization. Both plots have few variation. The third, using quantity information, is more regular and has less outliers than the second. But machine learning algorithms can also benefit from the second boxplot's variations.

Having more projects proposals do not lead to more approved pojects at all.

## Project price by State - Approved/Not Approved comparison

In [None]:
plt.figure(figsize=(12,20))

train['log1p_price'] = np.log1p(train['price'])
train['log1p_price_x_quantity'] = np.log1p(train['price'] * train['quantity'])

plt.subplot(121)
order = train.groupby(['school_state'])['log1p_price'].median().fillna(0).sort_values()[::-1].index
p = sns.boxplot(y='school_state', x='log1p_price', hue='project_is_approved', data=train, order=order, orient='h')
p = plt.title('Price boxplots of Projects\' school state (ONE item)',fontsize=12)
p = plt.xlabel('Log1p price', fontsize=12)
p = plt.ylabel('School State', fontsize=12)

plt.subplot(122)
order = train.groupby(['school_state'])['log1p_price_x_quantity'].median().fillna(0).sort_values()[::-1].index
p1 = sns.boxplot(y='school_state', x='log1p_price_x_quantity', hue='project_is_approved', data=train, order=order, orient='h')
p1 = plt.title('Price boxplots of Projects\' school state (ALL items)',fontsize=12)
p1 = plt.xlabel('Log1p price', fontsize=12)
p1 = plt.ylabel('School State', fontsize=12)

plt.show()

Now, these charts are intent to represent the differences in price distribution by state when a project is approved and when it is not. Most of boxplots have a similar behavior, but it is also worth to note that some of them have other shapes. Like states ND, WY, VT, NM, MN, etc, where the NOT APPROVED boxplot's third quartile is higher than its compared APPROVED boxplot.

## Project price by teacher's prefix

In [None]:
plt.figure(figsize=(12,12))

train['log1p_price_x_quantity'] = np.log1p(train['price'] * train['quantity'])

plt.subplot(211)
order = train.groupby(['teacher_prefix'])['log1p_price_x_quantity'].median().fillna(0).sort_values()[::-1].index
p1 = sns.boxplot(x='teacher_prefix', y='log1p_price_x_quantity', data=train, order=order)
p1 = plt.title('Price boxplots of Projects price by teacher\'s prefix (ALL items)',fontsize=12)
p1 = plt.ylabel('Log1p price', fontsize=12)
p1 = plt.xlabel('School State', fontsize=12)

plt.subplot(212)
order = train.groupby(['teacher_prefix'])['log1p_price_x_quantity'].median().fillna(0).sort_values()[::-1].index
p1 = sns.boxplot(x='teacher_prefix', y='log1p_price_x_quantity', hue='project_is_approved', data=train, order=order)
p1 = plt.title('Price boxplots of Projects price by teacher\'s prefix (ALL items)',fontsize=12)
p1 = plt.ylabel('Log1p price', fontsize=12)
p1 = plt.xlabel('School State', fontsize=12)

plt.show()

By these boxplots we can conclude that Dr. teachers have, in median, most expensive project proposals. And, in the same way, have the higher success rate, in median.

Let's look into real percentuals now.

In [None]:
plt.figure(figsize=(12,8))

plt.subplot(121)
p = plt.pie(train[train['project_is_approved']==1].groupby(['teacher_prefix'])['project_is_approved'].value_counts(), explode = (0,0,0,0,0.1),
        labels=[p for p in train['teacher_prefix'].unique() if type(p)==str], autopct=absolute_value)
p = plt.title('Approved projects percentual by teacher\'s prefix over training data', fontsize=12)
p = plt.legend(fontsize=12)

plt.subplot(122)
p1 = plt.pie(train[train['project_is_approved']==0].groupby(['teacher_prefix'])['project_is_approved'].value_counts(), explode = (0,0,0,0,0.1),
        labels=[p for p in train['teacher_prefix'].unique() if type(p)==str], autopct=absolute_value)
p1 = plt.title('NOT Approved projects percentual by teacher\'s prefix over training data', fontsize=12)
p1 = plt.legend(fontsize=12)
p1 = plt.show()

Even that Drs. does have a higher acceptance percentual, they don't represent a big part of all teachers. Just 1.73% of all approved projects are their, as well as 2.65% reject projects.  In other words, features related to Drs. will probably have small impact in precision metrics.

Note that I'm not saying to don't use them!

## Approved/NOT approved projects historical view
The data set provides a precise date info about every project proposal submission. Let's evaluate projects by the time they were proposed.

In [None]:
plt.figure(figsize=(12,15))

train['datetime_no_seconds'] = train['project_submitted_datetime'].dt.date
ts_train = train[['datetime_no_seconds','project_is_approved']].groupby('datetime_no_seconds').count()
test['datetime_no_seconds'] = test['project_submitted_datetime'].dt.date
ts_test = test[['datetime_no_seconds', 'teacher_id']].groupby('datetime_no_seconds').count()

plt.subplot(311)
p = plt.plot(ts_train, label='train', linewidth=2)
p = plt.plot(ts_test, label='test', linewidth=2)
p = plt.ylim(0,15000)
p = plt.title('Total project proposals over time',fontsize=12)
p = plt.ylabel('Total Project proposals', fontsize=12)
p = plt.xlabel('Date', fontsize=12)
p = plt.legend(fontsize=12)


plt.subplot(312)
ts_train_sum = train[train['project_is_approved']==1][['datetime_no_seconds','project_is_approved']].groupby('datetime_no_seconds').count()
p1 = plt.plot(ts_train_sum, label='Approved projects', linewidth=2)
ts_train_sum = train[train['project_is_approved']==0][['datetime_no_seconds','project_is_approved']].groupby('datetime_no_seconds').count()
p1 = plt.plot(ts_train_sum, label='NOT Approved projects', linewidth=2)
p1 = plt.ylim(0,15000)
p1 = plt.title('Approved and NOT approved projects over time',fontsize=12)
p1 = plt.ylabel('Total Project proposals', fontsize=12)
p1 = plt.xlabel('Date', fontsize=12)
p1 = plt.legend(fontsize=12)

plt.subplot(313)
ts_train_sum = train[train['project_is_approved']==1][['datetime_no_seconds','log1p_price_x_quantity']].groupby('datetime_no_seconds').mean()
p2 = plt.plot(ts_train_sum, label='Approved projects', linewidth=2)
ts_train_sum = train[train['project_is_approved']==0][['datetime_no_seconds','log1p_price_x_quantity']].groupby('datetime_no_seconds').mean()
p2 = plt.plot(ts_train_sum, label='NOT Approved projects', linewidth=2)
p2 = plt.title('Approved and NOT approved projects\' price over time',fontsize=12)
p2 = plt.ylabel('Mean log1p prices', fontsize=12)
p2 = plt.xlabel('Date', fontsize=12)
p2 = plt.legend(fontsize=12)

plt.show()

1. First Plot:
        Train and test sets shows a very similar distribution of the amount of sumitted projects over time. As the test set has less than a half of train set lenght, its graph has a small shift from the train's one.  Both of them have high submission periods at same time.
2. Second Plot:
        There are no big differences between the "approved" and "not approved" plots in the second graph, except for the number of projects. But both lines have the same trends.
3. Third Plot:
        Periods where expensive projects where approved or not approved intercalate amog over time.
        Good features can be extracted from this information. But I rather doubt that projects' price can be related to projects' acceptance (Compare Plots n° 2 and 3)

## To be continued..
    