# DonorsChoose

<p>
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
</p>
<p>
    Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
<ul>
<li>
    How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible</li>
    <li>How to increase the consistency of project vetting across different volunteers to improve the experience for teachers</li>
    <li>How to focus volunteer time on the applications that need the most assistance</li>
    </ul>
</p>    
<p>
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
</p>

## About the DonorsChoose Data Set

The `train.csv` data set provided by DonorsChoose contains the following features:

Feature | Description 
----------|---------------
**`project_id`** | A unique identifier for the proposed project. **Example:** `p036502`   
**`project_title`**    | Title of the project. **Examples:**<br><ul><li><code>Art Will Make You Happy!</code></li><li><code>First Grade Fun</code></li></ul> 
**`project_grade_category`** | Grade level of students for which the project is targeted. One of the following enumerated values: <br/><ul><li><code>Grades PreK-2</code></li><li><code>Grades 3-5</code></li><li><code>Grades 6-8</code></li><li><code>Grades 9-12</code></li></ul>  
 **`project_subject_categories`** | One or more (comma-separated) subject categories for the project from the following enumerated list of values:  <br/><ul><li><code>Applied Learning</code></li><li><code>Care &amp; Hunger</code></li><li><code>Health &amp; Sports</code></li><li><code>History &amp; Civics</code></li><li><code>Literacy &amp; Language</code></li><li><code>Math &amp; Science</code></li><li><code>Music &amp; The Arts</code></li><li><code>Special Needs</code></li><li><code>Warmth</code></li></ul><br/> **Examples:** <br/><ul><li><code>Music &amp; The Arts</code></li><li><code>Literacy &amp; Language, Math &amp; Science</code></li>  
  **`school_state`** | State where school is located ([Two-letter U.S. postal code](https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations#Postal_codes)). **Example:** `WY`
**`project_subject_subcategories`** | One or more (comma-separated) subject subcategories for the project. **Examples:** <br/><ul><li><code>Literacy</code></li><li><code>Literature &amp; Writing, Social Sciences</code></li></ul> 
**`project_resource_summary`** | An explanation of the resources needed for the project. **Example:** <br/><ul><li><code>My students need hands on literacy materials to manage sensory needs!</code</li></ul> 
**`project_essay_1`**    | First application essay<sup>*</sup>  
**`project_essay_2`**    | Second application essay<sup>*</sup> 
**`project_essay_3`**    | Third application essay<sup>*</sup> 
**`project_essay_4`**    | Fourth application essay<sup>*</sup> 
**`project_submitted_datetime`** | Datetime when project application was submitted. **Example:** `2016-04-28 12:43:56.245`   
**`teacher_id`** | A unique identifier for the teacher of the proposed project. **Example:** `bdf8baa8fedef6bfeec7ae4ff1c15c56`  
**`teacher_prefix`** | Teacher's title. One of the following enumerated values: <br/><ul><li><code>nan</code></li><li><code>Dr.</code></li><li><code>Mr.</code></li><li><code>Mrs.</code></li><li><code>Ms.</code></li><li><code>Teacher.</code></li></ul>  
**`teacher_number_of_previously_posted_projects`** | Number of project applications previously submitted by the same teacher. **Example:** `2` 

<sup>*</sup> See the section <b>Notes on the Essay Data</b> for more details about these features.

Additionally, the `resources.csv` data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature | Description 
----------|---------------
**`id`** | A `project_id` value from the `train.csv` file.  **Example:** `p036502`   
**`description`** | Desciption of the resource. **Example:** `Tenor Saxophone Reeds, Box of 25`   
**`quantity`** | Quantity of the resource required. **Example:** `3`   
**`price`** | Price of the resource required. **Example:** `9.95`   

**Note:** Many projects require multiple resources. The `id` value corresponds to a `project_id` in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label | Description
----------|---------------
`project_is_approved` | A binary flag indicating whether DonorsChoose approved the project. A value of `0` indicates the project was not approved, and a value of `1` indicates the project was approved.

### Notes on the Essay Data

<ul>
Prior to May 17, 2016, the prompts for the essays were as follows:
<li>__project_essay_1:__ "Introduce us to your classroom"</li>
<li>__project_essay_2:__ "Tell us more about your students"</li>
<li>__project_essay_3:__ "Describe how your students will use the materials you're requesting"</li>
<li>__project_essay_3:__ "Close by sharing why your project will make a difference"</li>
</ul>


<ul>
Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:<br>
<li>__project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."</li>
<li>__project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"</li>
<br>For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
</ul>


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.preprocessing import LabelEncoder
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from chart_studio import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter
from keras.utils import to_categorical

from tensorflow.keras.callbacks import TensorBoard

Output hidden; open in https://colab.research.google.com to view.

In [0]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM,Bidirectional
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import Dense


## 1.1 Reading Data

In [0]:
file_train = "/content/drive/My Drive/Colab Notebooks/lstmDonorChoose/train_data.csv"
file_resource = "/content/drive/My Drive/Colab Notebooks/lstmDonorChoose/resources.csv"

In [0]:
project_data = pd.read_csv(file_train)
resource_data = pd.read_csv(file_resource)

In [6]:
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)

Number of data points in train data (109248, 17)
--------------------------------------------------
The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state'
 'project_submitted_datetime' 'project_grade_category'
 'project_subject_categories' 'project_subject_subcategories'
 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3'
 'project_essay_4' 'project_resource_summary'
 'teacher_number_of_previously_posted_projects' 'project_is_approved']


In [7]:
print("Number of data points in resource data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)

Number of data points in resource data (1541272, 4)
['id' 'description' 'quantity' 'price']


Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95


In [8]:
project_data.isnull().sum()

Unnamed: 0                                           0
id                                                   0
teacher_id                                           0
teacher_prefix                                       3
school_state                                         0
project_submitted_datetime                           0
project_grade_category                               0
project_subject_categories                           0
project_subject_subcategories                        0
project_title                                        0
project_essay_1                                      0
project_essay_2                                      0
project_essay_3                                 105490
project_essay_4                                 105490
project_resource_summary                             0
teacher_number_of_previously_posted_projects         0
project_is_approved                                  0
dtype: int64

In [0]:
project_data= project_data[project_data["teacher_prefix"].notnull()]

In [10]:
project_data.isnull().sum()

Unnamed: 0                                           0
id                                                   0
teacher_id                                           0
teacher_prefix                                       0
school_state                                         0
project_submitted_datetime                           0
project_grade_category                               0
project_subject_categories                           0
project_subject_subcategories                        0
project_title                                        0
project_essay_1                                      0
project_essay_2                                      0
project_essay_3                                 105488
project_essay_4                                 105488
project_resource_summary                             0
teacher_number_of_previously_posted_projects         0
project_is_approved                                  0
dtype: int64

## 1.2 preprocessing of `project_subject_categories`

In [0]:
catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())

cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))


## 1.3 preprocessing of `project_subject_subcategories`

In [0]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))

## 1.3 Text preprocessing

In [0]:
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)

In [14]:
project_data.head(2)

Unnamed: 0.1,Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,clean_categories,clean_subcategories,essay
0,160221,p253737,c90749f5d961ff158d4b4d1e7dc665fc,Mrs.,IN,2016-12-05 13:43:57,Grades PreK-2,Educational Support for English Learners at Home,My students are English learners that are work...,"\""The limits of your language are the limits o...",,,My students need opportunities to practice beg...,0,0,Literacy_Language,ESL Literacy,My students are English learners that are work...
1,140945,p258326,897464ce9ddc600bced1151f324dd63a,Mr.,FL,2016-10-25 09:22:10,Grades 6-8,Wanted: Projector for Hungry Learners,Our students arrive to our school eager to lea...,The projector we need for our school is very c...,,,My students need a projector to help with view...,7,1,History_Civics Health_Sports,Civics_Government TeamSports,Our students arrive to our school eager to lea...


In [0]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [16]:
sent = decontracted(project_data['essay'].values[20000])
print(sent)
print("="*50)

My wonderful students are 3, 4, and 5 years old.  We are located in a small town outside of Charlotte, NC.  All of my 22 students are children of school district employees.\r\nMy students are bright, energetic, and they love to learn!  They love hands-on activities that get them moving.  Like most preschoolers, they enjoy music and creating different things. \r\nAll of my students come from wonderful families that are very supportive of our classroom.  Our parents enjoy watching their children is growth as much as we do!These materials will help me teach my students all about the life cycle of a butterfly.  We will watch as the Painted Lady caterpillars grow bigger and build their chrysalis.  After a few weeks they will emerge from the chrysalis as beautiful butterflies!  We already have a net for the chrysalises, but we still need the caterpillars and feeding station.\r\nThis will be an unforgettable experience for my students.  My student absolutely love hands-on materials.  They lea

In [17]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)

My wonderful students are 3, 4, and 5 years old.  We are located in a small town outside of Charlotte, NC.  All of my 22 students are children of school district employees.  My students are bright, energetic, and they love to learn!  They love hands-on activities that get them moving.  Like most preschoolers, they enjoy music and creating different things.   All of my students come from wonderful families that are very supportive of our classroom.  Our parents enjoy watching their children is growth as much as we do!These materials will help me teach my students all about the life cycle of a butterfly.  We will watch as the Painted Lady caterpillars grow bigger and build their chrysalis.  After a few weeks they will emerge from the chrysalis as beautiful butterflies!  We already have a net for the chrysalises, but we still need the caterpillars and feeding station.  This will be an unforgettable experience for my students.  My student absolutely love hands-on materials.  They learn so 

In [18]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)

My wonderful students are 3 4 and 5 years old We are located in a small town outside of Charlotte NC All of my 22 students are children of school district employees My students are bright energetic and they love to learn They love hands on activities that get them moving Like most preschoolers they enjoy music and creating different things All of my students come from wonderful families that are very supportive of our classroom Our parents enjoy watching their children is growth as much as we do These materials will help me teach my students all about the life cycle of a butterfly We will watch as the Painted Lady caterpillars grow bigger and build their chrysalis After a few weeks they will emerge from the chrysalis as beautiful butterflies We already have a net for the chrysalises but we still need the caterpillars and feeding station This will be an unforgettable experience for my students My student absolutely love hands on materials They learn so much from getting to touch and man

In [0]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

In [20]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split())
    preprocessed_essays.append(sent.lower().strip())

100%|██████████| 109245/109245 [00:12<00:00, 8884.03it/s]


<h2><font color='red'> 1.4 Preprocessing of `project_title`</font></h2>

In [21]:
# similarly you can preprocess the titles also
preprocessed_titles = []
for sentance in tqdm(project_data['project_title'].values):
    sent = decontracted(sentance)
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e not in stopwords)
    preprocessed_titles.append(sent.lower().strip())

100%|██████████| 109245/109245 [00:01<00:00, 55769.39it/s]


In [22]:
project_data.columns

Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category', 'project_title',
       'project_essay_1', 'project_essay_2', 'project_essay_3',
       'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'essay'],
      dtype='object')

we are going to consider

       - school_state : categorical data
       - clean_categories : categorical data
       - clean_subcategories : categorical data
       - project_grade_category : categorical data
       - teacher_prefix : categorical data
       
       - project_title : text data
       - text : text data
       - project_resource_summary: text data (optinal)
       
       - quantity : numerical (optinal)
       - teacher_number_of_previously_posted_projects : numerical
       - price : numerical

In [0]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data = pd.merge(project_data, price_data, on='id', how='left')

In [25]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329.   ... 399.   287.73   5.5 ].
# Reshape your data either using array.reshape(-1, 1)

price_scalar = StandardScaler()
price_scalar.fit(project_data['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_standardized = price_scalar.transform(project_data['price'].values.reshape(-1, 1))

Mean : 298.1152448166964, Standard deviation : 367.49642545627506


In [26]:
price_standardized

array([[-0.39052147],
       [ 0.00240752],
       [ 0.5952024 ],
       ...,
       [-0.1582471 ],
       [-0.61242839],
       [-0.51215531]])

In [0]:
cleaned_data=project_data.copy()
# Adding preprocessed_essays and titles to the dataframe
cleaned_data['cleaned_essay']=preprocessed_essays
cleaned_data['cleaned_titles']=preprocessed_titles
cleaned_data.drop(['project_title','project_essay_1','project_essay_2','project_essay_3','project_essay_4'],axis=1,inplace=True)
y=cleaned_data['project_is_approved']
cleaned_data.drop(['project_is_approved'],axis=1, inplace=True)
x=cleaned_data

In [0]:
from sklearn.model_selection import train_test_split
x_train1,x_test,y_train1,y_test=train_test_split(x,y,test_size=0.2,stratify=y)
x_train,x_cross,y_train,y_cv=train_test_split(x_train1,y_train1,test_size=0.2,stratify=y_train1)

In [29]:
print(x_train.shape, y_train.shape)
print(x_cross.shape, y_cv.shape)
print(x_test.shape, y_test.shape)

(69916, 16) (69916,)
(17480, 16) (17480,)
(21849, 16) (21849,)


In [30]:
y_train.values

array([1, 1, 1, ..., 1, 1, 1])

In [0]:
#https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values

class LabelEncoderExt(object):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        """
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        """
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        """
        self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        """
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        :return:
        """
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)


In [0]:
x_train['teacher_prefix'].fillna(value="Mrs.", inplace=True)
x_cross['teacher_prefix'].fillna(value="Mrs.", inplace=True)
x_test['teacher_prefix'].fillna(value="Mrs.", inplace=True)
vectorizer = LabelEncoderExt()
vectorizer.fit(x_train['teacher_prefix'].values)
x_train_teacher_prefix_ohe = vectorizer.transform(x_train['teacher_prefix'].values)
x_cv_teacher_prefix_ohe = vectorizer.transform(x_cross['teacher_prefix'].values)
x_test_teacher_prefix_ohe = vectorizer.transform(x_test['teacher_prefix'].values)

vectorizer = LabelEncoderExt()
vectorizer.fit(x_train['school_state'].values)
x_train_state_ohe = vectorizer.transform(x_train['school_state'].values)
x_cv_state_ohe = vectorizer.transform(x_cross['school_state'].values)
x_test_state_ohe = vectorizer.transform(x_test['school_state'].values)

vectorizer = LabelEncoderExt()
vectorizer.fit(['grades_3_5', 'grades_6_8', 'grades_9_12', 'grades_prek_2'])
x_train_grade_ohe = vectorizer.transform(x_train['project_grade_category'].values)
x_cv_grade_ohe = vectorizer.transform(x_cross['project_grade_category'].values)
x_test_grade_ohe = vectorizer.transform(x_test['project_grade_category'].values)

vectorizer = LabelEncoderExt()
vectorizer.fit(x_train['clean_categories'].values)
x_train_cat_ohe = vectorizer.transform(x_train['clean_categories'].values)
x_cv_cat_ohe = vectorizer.transform(x_cross['clean_categories'].values)
x_test_cat_ohe = vectorizer.transform(x_test['clean_categories'].values)

vectorizer = LabelEncoderExt()
vectorizer.fit(x_train['clean_subcategories'].values)
x_train_scat_ohe = vectorizer.transform(x_train['clean_subcategories'].values)
x_cv_scat_ohe = vectorizer.transform(x_cross['clean_subcategories'].values)
x_test_scat_ohe = vectorizer.transform(x_test['clean_subcategories'].values)


In [0]:
#https://medium.com/@davidheffernan_99410/an-introduction-to-using-categorical-embeddings-ee686ed7e7f9
cat_vars = ["teacher_prefix","school_state","project_grade_category","clean_categories","clean_subcategories"]
cat_sizes = {}
cat_embsizes = {}
for cat in cat_vars:
    cat_sizes[cat] = x_train[cat].nunique()
    cat_embsizes[cat] = min(50, cat_sizes[cat]//2+1)

In [0]:
from keras.layers import Reshape

In [34]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(x_train['price'].values.reshape(1,-1))

x_train_price_norm = normalizer.transform(x_train['price'].values.reshape(1,-1)).T
x_cv_price_norm = normalizer.transform(x_cross['price'].values.reshape(1,-1)).T
x_test_price_norm = normalizer.transform(x_test['price'].values.reshape(1,-1)).T

print("After normalizing price")
print(x_train_price_norm.shape, y_train.shape)
print(x_cv_price_norm.shape, y_cv.shape)
print(x_test_price_norm.shape, y_test.shape)

print("========================================================")
normalizer = Normalizer()
normalizer.fit(x_train['quantity'].values.reshape(1,-1))

x_train_qty_norm = normalizer.transform(x_train['quantity'].values.reshape(1,-1)).T
x_cv_qty_norm = normalizer.transform(x_cross['quantity'].values.reshape(1,-1)).T
x_test_qty_norm = normalizer.transform(x_test['quantity'].values.reshape(1,-1)).T
print("After normalizing the quantity")
print(x_train_qty_norm.shape, y_train.shape)
print(x_cv_qty_norm.shape, y_cv.shape)
print(x_test_qty_norm.shape, y_test.shape)
print("========================================================")

normalizer = Normalizer()
normalizer.fit(x_train['teacher_number_of_previously_posted_projects'].values.reshape(1,-1))

x_train_tpp_norm = normalizer.transform(x_train['teacher_number_of_previously_posted_projects'].values.reshape(1,-1)).T
x_cv_tpp_norm = normalizer.transform(x_cross['teacher_number_of_previously_posted_projects'].values.reshape(1,-1)).T
x_test_tpp_norm = normalizer.transform(x_test['teacher_number_of_previously_posted_projects'].values.reshape(1,-1)).T
print("After normalizing the teacher_number_of_previously_posted_projects")
print(x_train_tpp_norm.shape, y_train.shape)
print(x_cv_tpp_norm.shape, y_cv.shape)
print(x_test_tpp_norm.shape, y_test.shape)

print("========================================================")



After normalizing price
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)
After normalizing the quantity
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)
After normalizing the teacher_number_of_previously_posted_projects
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)


### Assignment 1

In [0]:

from tensorflow.keras import *
from tensorboardcolab import *
from keras.regularizers import l2

In [37]:
import keras.backend as K
K.clear_session()






In [38]:
x_train['teacher_prefix'].fillna(value="Mrs.", inplace=True)
x_cross['teacher_prefix'].fillna(value="Mrs.", inplace=True)
x_test['teacher_prefix'].fillna(value="Mrs.", inplace=True)
vectorizer = CountVectorizer()
vectorizer.fit(x_train['teacher_prefix'].values)
x_train_teacher_prefix_ohe = vectorizer.transform(x_train['teacher_prefix'].values)
x_cv_teacher_prefix_ohe = vectorizer.transform(x_cross['teacher_prefix'].values)
x_test_teacher_prefix_ohe = vectorizer.transform(x_test['teacher_prefix'].values)
teacher_prefix_f=vectorizer.get_feature_names()
print("After vectorizations of teacher_prefix")
print(x_train_teacher_prefix_ohe.shape, y_train.shape)
print(x_cv_teacher_prefix_ohe.shape, y_cv.shape)
print(x_test_teacher_prefix_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())

print("========================================================")

vectorizer = CountVectorizer()
vectorizer.fit(x_train['school_state'].values)
x_train_state_ohe = vectorizer.transform(x_train['school_state'].values)
x_cv_state_ohe = vectorizer.transform(x_cross['school_state'].values)
x_test_state_ohe = vectorizer.transform(x_test['school_state'].values)
state_f=vectorizer.get_feature_names()
print("After vectorizations of school_state")
print(x_train_state_ohe.shape, y_train.shape)
print(x_cv_state_ohe.shape, y_cv.shape)
print(x_test_state_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("========================================================")
vectorizer = CountVectorizer()
vectorizer.fit(['grades_3_5', 'grades_6_8', 'grades_9_12', 'grades_prek_2'])
x_train_grade_ohe = vectorizer.transform(x_train['project_grade_category'].values)
x_cv_grade_ohe = vectorizer.transform(x_cross['project_grade_category'].values)
x_test_grade_ohe = vectorizer.transform(x_test['project_grade_category'].values)
teacher_grade_f=vectorizer.get_feature_names()
print("After vectorizations of project_grade_category")
print(x_train_grade_ohe.shape, y_train.shape)
print(x_cv_grade_ohe.shape, y_cv.shape)
print(x_test_grade_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())

print("========================================================")

vectorizer = CountVectorizer()
vectorizer.fit(x_train['clean_categories'].values)
x_train_cat_ohe = vectorizer.transform(x_train['clean_categories'].values)
x_cv_cat_ohe = vectorizer.transform(x_cross['clean_categories'].values)
x_test_cat_ohe = vectorizer.transform(x_test['clean_categories'].values)
teacher_cat_f=vectorizer.get_feature_names()
print("After vectorizations of clean_categories")
print(x_train_cat_ohe.shape, y_train.shape)
print(x_cv_cat_ohe.shape, y_cv.shape)
print(x_test_cat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())

print("========================================================")

vectorizer = CountVectorizer()
vectorizer.fit(x_train['clean_subcategories'].values)
x_train_scat_ohe = vectorizer.transform(x_train['clean_subcategories'].values)
x_cv_scat_ohe = vectorizer.transform(x_cross['clean_subcategories'].values)
x_test_scat_ohe = vectorizer.transform(x_test['clean_subcategories'].values)
teacher_scat_f=vectorizer.get_feature_names()
print("After vectorizations of clean_subcategories ")
print(x_train_scat_ohe.shape, y_train.shape)
print(x_cv_scat_ohe.shape, y_cv.shape)
print(x_test_scat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())

print("========================================================")

After vectorizations of teacher_prefix
(69916, 5) (69916,)
(17480, 5) (17480,)
(21849, 5) (21849,)
['dr', 'mr', 'mrs', 'ms', 'teacher']
After vectorizations of school_state
(69916, 51) (69916,)
(17480, 51) (17480,)
(21849, 51) (21849,)
['ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga', 'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me', 'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm', 'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx', 'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy']
After vectorizations of project_grade_category
(69916, 4) (69916,)
(17480, 4) (17480,)
(21849, 4) (21849,)
['grades_3_5', 'grades_6_8', 'grades_9_12', 'grades_prek_2']
After vectorizations of clean_categories
(69916, 9) (69916,)
(17480, 9) (17480,)
(21849, 9) (21849,)
['appliedlearning', 'care_hunger', 'health_sports', 'history_civics', 'literacy_language', 'math_science', 'music_arts', 'specialneeds', 'warmth']
After vectorizations of clean_subcategori

In [39]:
normalizer = Normalizer()
normalizer.fit(x_train['price'].values.reshape(-1,1))

x_train_price_norm = normalizer.transform(x_train['price'].values.reshape(-1,1))
x_cv_price_norm = normalizer.transform(x_cross['price'].values.reshape(-1,1))
x_test_price_norm = normalizer.transform(x_test['price'].values.reshape(-1,1))

print("After normalizing price")
print(x_train_price_norm.shape, y_train.shape)
print(x_cv_price_norm.shape, y_cv.shape)
print(x_test_price_norm.shape, y_test.shape)

print("========================================================")

normalizer = Normalizer()
normalizer.fit(x_train['quantity'].values.reshape(-1,1))

x_train_qty_norm = normalizer.transform(x_train['quantity'].values.reshape(-1,1))
x_cv_qty_norm = normalizer.transform(x_cross['quantity'].values.reshape(-1,1))
x_test_qty_norm = normalizer.transform(x_test['quantity'].values.reshape(-1,1))
print("After normalizing the quantity")
print(x_train_qty_norm.shape, y_train.shape)
print(x_cv_qty_norm.shape, y_cv.shape)
print(x_test_qty_norm.shape, y_test.shape)
print("========================================================")

normalizer = Normalizer()
normalizer.fit(x_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))

x_train_tpp_norm = normalizer.transform(x_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
x_cv_tpp_norm = normalizer.transform(x_cross['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
x_test_tpp_norm = normalizer.transform(x_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
print("After normalizing the teacher_number_of_previously_posted_projects")
print(x_train_qty_norm.shape, y_train.shape)
print(x_cv_qty_norm.shape, y_cv.shape)
print(x_test_qty_norm.shape, y_test.shape)

After normalizing price
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)
After normalizing the quantity
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)
After normalizing the teacher_number_of_previously_posted_projects
(69916, 1) (69916,)
(17480, 1) (17480,)
(21849, 1) (21849,)


In [40]:
x_tr_rem = hstack((x_train_state_ohe, x_train_teacher_prefix_ohe, x_train_grade_ohe,x_train_scat_ohe,x_train_cat_ohe, x_train_price_norm,x_train_qty_norm,x_train_tpp_norm)).todense()
x_cv_rem = hstack(( x_cv_state_ohe, x_cv_teacher_prefix_ohe, x_cv_grade_ohe,x_cv_scat_ohe,x_cv_cat_ohe, x_cv_price_norm,x_cv_qty_norm,x_cv_tpp_norm)).todense()
x_te_rem = hstack((x_test_state_ohe, x_test_teacher_prefix_ohe, x_test_grade_ohe,x_test_scat_ohe,x_test_cat_ohe, x_test_price_norm,x_test_qty_norm,x_test_tpp_norm)).todense()
print("Final Data matrix")
print(x_tr_rem.shape, y_train.shape)
print(x_cv_rem.shape, y_cv.shape)
print(x_te_rem.shape, y_test.shape)
print("="*100)

Final Data matrix
(69916, 102) (69916,)
(17480, 102) (17480,)
(21849, 102) (21849,)


In [0]:
mms = StandardScaler().fit(x_tr_rem)
x_tr_rem_norm = mms.transform(x_tr_rem)
x_cv_rem_norm = mms.transform(x_cv_rem)
x_te_rem_norm = mms.transform(x_te_rem)

In [0]:
x_tr_rem_norm.shape

(69916, 102)

In [0]:
x_tr_rem_reshape = np.array(x_tr_rem).reshape(69916,102,1)
x_cv_rem_reshape = np.array(x_cv_rem).reshape(17480, 102,1)
x_test_rem_reshape = np.array(x_te_rem).reshape(21849, 102,1)

In [0]:
x_tr_rem_reshape.shape

(69916, 102, 1)

In [0]:
max_length=400

In [0]:
#https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
def padded(encoded_docs):  
  max_length = 400
  padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
  return padded_docs

In [0]:
#https://stackoverflow.com/posts/51956230/revisions
t = Tokenizer()
t.fit_on_texts(x_train.cleaned_essay)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(x_train.cleaned_essay)
essay_padded_train = padded(encoded_docs)

In [0]:

#t = Tokenizer()
#t.fit_on_texts(x_cross.cleaned_essay)
#vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(x_cross.cleaned_essay)
essay_padded_cv = padded(encoded_docs)

In [0]:

#t = Tokenizer()
#t.fit_on_texts(x_test.cleaned_essay)
#vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(x_test.cleaned_essay)
essay_padded_test = padded(encoded_docs)

In [0]:

print("encoded train data shape",essay_padded_train.shape)
print("encoded cv data shape",essay_padded_cv.shape)
print("encoded cv data shape",essay_padded_test.shape)

encoded train data shape (69916, 400)
encoded cv data shape (17480, 400)
encoded cv data shape (21849, 400)


In [0]:
from numpy import asarray
from tensorflow.keras.layers import Conv1D,MaxPooling1D

In [0]:
embeddings_index = dict()
f = open('/content/drive/My Drive/Colab Notebooks/lstmDonorChoose/glove.42B.300d.txt','r',encoding="utf8")#f = open(gloveFile,'r', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [0]:

embedding_matrix = zeros((vocab_size, 300))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [52]:
print("embedding matrix shape",embedding_matrix.shape)

embedding matrix shape (47397, 300)


In [0]:
from tensorflow.keras.layers import Reshape,Concatenate, Dropout

In [68]:
text_input = Input(shape=(400,), name = "text_input")
# max_length = 150 ---->max length of sentence

e1 = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=400)(text_input)
l1= LSTM(128,activation = "relu",dropout=0.5,kernel_regularizer=l2(0.001),kernel_initializer='glorot_normal',return_sequences=True,input_shape=(150,300))(e1)
l1= Dropout(0.6)(l1)
f1= Flatten()(l1)
f1= Dropout(0.6)(f1)
rem = Input(shape=(x_tr_rem.shape[1],1), name="rem")
rem_conv1 = Conv1D(128, 3,kernel_initializer='glorot_normal')(rem)
rem_conv1= Dropout(0.6)(rem_conv1)
max_pool =MaxPooling1D(3)(rem_conv1)
f2= Flatten()(max_pool)
f2= Dropout(0.6)(f2)
x = Concatenate()([f1,f2])
x= Dense(32,kernel_regularizer=l2(0.001),kernel_initializer='glorot_normal')(x)
x= Dense(16, activation='relu')(x)
output=Dense(2, activation='softmax')(x)
model_3 = Model(inputs=[text_input,rem], outputs=output)
model_3.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_input (InputLayer)         [(None, 400)]        0                                            
__________________________________________________________________________________________________
rem (InputLayer)                [(None, 102, 1)]     0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 400, 300)     14219100    text_input[0][0]                 
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 100, 128)     512         rem[0][0]                        
____________________________________________________________________________________________

In [0]:
#https://stackoverflow.com/posts/51734992/revisions
import tensorflow as tf
from sklearn.metrics import roc_auc_score

def auroc(y_true, y_pred):
    return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double)

In [0]:

adam = tf.keras.optimizers.Adam(lr=0.001)
model_3.compile(optimizer=adam, loss='categorical_crossentropy',metrics=[auroc])

In [0]:
batch_size=512

In [0]:
logdir = './lstm_callbacks'

if not os.path.exists(logdir):
    os.mkdir(logdir)
output_model_file = os.path.join(logdir,
                                 "lstm_model_3.h5")

In [0]:
from tensorflow.keras.callbacks import *
callbacks = [
    
    ModelCheckpoint(output_model_file, monitor='val_loss',
                                    save_best_only = True),
    EarlyStopping(patience=5, min_delta=1e-3),
]

In [0]:
y_binary_train = to_categorical(y_train)
y_binary_cv = to_categorical(y_cv)
y_binary_test = to_categorical(y_test)

In [70]:
history_3= model_3.fit({'text_input': essay_padded_train, 'rem':x_tr_rem_reshape},y_binary_train,
          epochs=20, batch_size=batch_size,verbose=1, validation_data=({'text_input': essay_padded_cv, 'rem': x_cv_rem_reshape},y_binary_cv),callbacks=callbacks)

Train on 69916 samples, validate on 17480 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20


In [71]:
result = model_3.evaluate({'text_input': essay_padded_test, 'rem':x_test_rem_reshape},
          y_binary_test,batch_size=batch_size,callbacks=callbacks)



Model 3 Accuracy : 72.87 with loss = 0.4681