# DonorsChoose

<p>
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
</p>
<p>
    Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
<ul>
<li>
    How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible</li>
    <li>How to increase the consistency of project vetting across different volunteers to improve the experience for teachers</li>
    <li>How to focus volunteer time on the applications that need the most assistance</li>
    </ul>
</p>    
<p>
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
</p>

## About the DonorsChoose Data Set

The `train.csv` data set provided by DonorsChoose contains the following features:

Feature | Description 
----------|---------------
**`project_id`** | A unique identifier for the proposed project. **Example:** `p036502`   
**`project_title`**    | Title of the project. **Examples:**<br><ul><li><code>Art Will Make You Happy!</code></li><li><code>First Grade Fun</code></li></ul> 
**`project_grade_category`** | Grade level of students for which the project is targeted. One of the following enumerated values: <br/><ul><li><code>Grades PreK-2</code></li><li><code>Grades 3-5</code></li><li><code>Grades 6-8</code></li><li><code>Grades 9-12</code></li></ul>  
 **`project_subject_categories`** | One or more (comma-separated) subject categories for the project from the following enumerated list of values:  <br/><ul><li><code>Applied Learning</code></li><li><code>Care &amp; Hunger</code></li><li><code>Health &amp; Sports</code></li><li><code>History &amp; Civics</code></li><li><code>Literacy &amp; Language</code></li><li><code>Math &amp; Science</code></li><li><code>Music &amp; The Arts</code></li><li><code>Special Needs</code></li><li><code>Warmth</code></li></ul><br/> **Examples:** <br/><ul><li><code>Music &amp; The Arts</code></li><li><code>Literacy &amp; Language, Math &amp; Science</code></li>  
  **`school_state`** | State where school is located ([Two-letter U.S. postal code](https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations#Postal_codes)). **Example:** `WY`
**`project_subject_subcategories`** | One or more (comma-separated) subject subcategories for the project. **Examples:** <br/><ul><li><code>Literacy</code></li><li><code>Literature &amp; Writing, Social Sciences</code></li></ul> 
**`project_resource_summary`** | An explanation of the resources needed for the project. **Example:** <br/><ul><li><code>My students need hands on literacy materials to manage sensory needs!</code</li></ul> 
**`project_essay_1`**    | First application essay<sup>*</sup>  
**`project_essay_2`**    | Second application essay<sup>*</sup> 
**`project_essay_3`**    | Third application essay<sup>*</sup> 
**`project_essay_4`**    | Fourth application essay<sup>*</sup> 
**`project_submitted_datetime`** | Datetime when project application was submitted. **Example:** `2016-04-28 12:43:56.245`   
**`teacher_id`** | A unique identifier for the teacher of the proposed project. **Example:** `bdf8baa8fedef6bfeec7ae4ff1c15c56`  
**`teacher_prefix`** | Teacher's title. One of the following enumerated values: <br/><ul><li><code>nan</code></li><li><code>Dr.</code></li><li><code>Mr.</code></li><li><code>Mrs.</code></li><li><code>Ms.</code></li><li><code>Teacher.</code></li></ul>  
**`teacher_number_of_previously_posted_projects`** | Number of project applications previously submitted by the same teacher. **Example:** `2` 

<sup>*</sup> See the section <b>Notes on the Essay Data</b> for more details about these features.

Additionally, the `resources.csv` data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature | Description 
----------|---------------
**`id`** | A `project_id` value from the `train.csv` file.  **Example:** `p036502`   
**`description`** | Desciption of the resource. **Example:** `Tenor Saxophone Reeds, Box of 25`   
**`quantity`** | Quantity of the resource required. **Example:** `3`   
**`price`** | Price of the resource required. **Example:** `9.95`   

**Note:** Many projects require multiple resources. The `id` value corresponds to a `project_id` in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label | Description
----------|---------------
`project_is_approved` | A binary flag indicating whether DonorsChoose approved the project. A value of `0` indicates the project was not approved, and a value of `1` indicates the project was approved.

### Notes on the Essay Data

<ul>
Prior to May 17, 2016, the prompts for the essays were as follows:
<li>__project_essay_1:__ "Introduce us to your classroom"</li>
<li>__project_essay_2:__ "Tell us more about your students"</li>
<li>__project_essay_3:__ "Describe how your students will use the materials you're requesting"</li>
<li>__project_essay_3:__ "Close by sharing why your project will make a difference"</li>
</ul>


<ul>
Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:<br>
<li>__project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."</li>
<li>__project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"</li>
<br>For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
</ul>


In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter

## 1.1 Reading Data

In [None]:
# If using Colab
# from google.colab import drive
# drive.mount('/content/drive/')

# project_data = pd.read_csv('/content/drive/My Drive/IPYNB/train_data.csv')
# resource_data = pd.read_csv('/content/drive/My Drive/IPYNB/resources.csv')

In [None]:
!ls "../input/KNN AAIC DATA"

In [None]:

project_data = pd.read_csv('../input/knn-aaic-data/train_data.csv')
resource_data = pd.read_csv('../input/knn-aaic-data/resources.csv')

In [None]:
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)
project_data.replace(".nannan", "")

In [None]:
# how to replace elements in list python: https://stackoverflow.com/a/2582163/4084039
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(project_data.columns)]


#sort dataframe based on time pandas python: https://stackoverflow.com/a/49702492/4084039
project_data['Date'] = pd.to_datetime(project_data['project_submitted_datetime'])
project_data.drop('project_submitted_datetime', axis=1, inplace=True)
project_data.sort_values(by=['Date'], inplace=True)


# how to reorder columns pandas python: https://stackoverflow.com/a/13148611/4084039
project_data = project_data[cols]


project_data.head(2)

In [None]:
print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)

## 1.2 preprocessing of `project_subject_categories`

In [None]:
catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter# Used to count the number of classes in a column
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())
cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))

In [None]:
sorted_cat_dict

## 1.3 preprocessing of `project_subject_subcategories`

In [None]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))

## 1.3 Text preprocessing

In [None]:
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)
len(project_data["essay"])

In [None]:
project_data.head(2)

In [None]:
#### 1.4.2.3 Using Pretrained Models: TFIDF weighted W2V
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(project_data["essay"][:50000])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(project_data["essay"][:50000]): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors.append(vector)

project_data['essay'] = project_data['essay'].replace("nannan", "")
project_data['essay'] = project_data['essay'].replace("nan", "")

print(len(tfidf_w2v_vectors))
print(len(tfidf_w2v_vectors[0]))

In [None]:
# printing some random reviews
print(project_data['essay'].values[0])
print("="*50)
print(project_data['essay'].values[150])
print("="*50)
print(project_data['essay'].values[1000])
print("="*50)
print(project_data['essay'].values[20000])
print("="*50)
print(project_data['essay'].values[99999])
print("="*50)

In [None]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
sent = decontracted(project_data['essay'].values[20000])
print(sent)
print("="*50)

In [None]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)

In [None]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)

print(sent)

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

In [None]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'][:50000].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())

In [None]:
# after preprocesing
preprocessed_essays[:200]


<h2><font color='red'> 1.4 Preprocessing of `project_title`</font></h2>

In [None]:
# similarly you can preprocess the titles also
project_data['project_title']
preprocessed_title = []
for sentance in tqdm(project_data['project_title'][:50000]):
    s = sentance.replace("\\r", "")
    s = s.replace("\\n", "")
    s = decontracted(sentance)
    s = re.sub('[^A-Za-z0-9]+', ' ', sentance)
    s = s.replace("nannan","")
    s = s.replace("nan","")
    preprocessed_title.append(s)
preprocessed_title

## 1.5 Preparing data for models

In [None]:
project_data.columns

we are going to consider

       - school_state : categorical data
       - clean_categories : categorical data
       - clean_subcategories : categorical data
       - project_grade_category : categorical data
       - teacher_prefix : categorical data
       
       - project_title : text data
       - text : text data
       - project_resource_summary: text data (optinal)
       
       - quantity : numerical (optinal)
       - teacher_number_of_previously_posted_projects : numerical
       - price : numerical

### 1.5.1 Vectorizing Categorical data

- https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/handling-categorical-and-numerical-features/

In [None]:
# we use count vectorizer to convert the values into one 
from sklearn.feature_extraction.text import CountVectorizer# One hot-encoding
vectorizer = CountVectorizer(vocabulary=list(sorted_cat_dict.keys()), lowercase=False, binary=True)
categories_one_hot = vectorizer.fit_transform(project_data['clean_categories'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",categories_one_hot.shape)

In [None]:
# we use count vectorizer to convert the values into one 
vectorizer = CountVectorizer(vocabulary=list(sorted_sub_cat_dict.keys()), lowercase=False, binary=True)
sub_categories_one_hot = vectorizer.fit_transform(project_data['clean_subcategories'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",sub_categories_one_hot.shape)

In [None]:
# you can do the similar thing with state, teacher_prefix and project_grade_category also
project_data['teacher_prefix'].dropna(inplace=True)

vectorizer = CountVectorizer(vocabulary=list(set(project_data['school_state'][:50000])), lowercase=False, binary=True)
school_state_one_hot = vectorizer.fit_transform(project_data['school_state'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",school_state_one_hot.shape)

vectorizer = CountVectorizer(vocabulary=list(set(project_data['teacher_prefix'][:50000])), lowercase=False, binary=True)
teacher_prefix_one_hot = vectorizer.fit_transform(project_data['teacher_prefix'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",teacher_prefix_one_hot.shape)

vectorizer = CountVectorizer(vocabulary=list(set(project_data['project_grade_category'][:50000])), lowercase=False, binary=True)
project_grade_category_one_hot = vectorizer.fit_transform(project_data['project_grade_category'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",project_grade_category_one_hot.shape)

project_data['teacher_prefix'][:50000]
print(project_grade_category_one_hot)



### 1.5.2 Vectorizing Text data

#### 1.5.2.1 Bag of words

In [None]:
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer = CountVectorizer(min_df=10)
essay_bow = vectorizer.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encodig ",essay_bow.shape)

In [None]:
# you can vectorize the title also 
# before you vectorize the title make sure you preprocess it
vectorizer = CountVectorizer(min_df=10)
title_bow = vectorizer.fit_transform(preprocessed_title)
print("Shape of matrix after one hot encodig ",title_bow.shape)


#### 1.5.2.2 TFIDF vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10, max_features= 5000)
text_tfidf = vectorizer.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encodig ",text_tfidf.shape)

#### 1.5.2.3 Using Pretrained Models: Avg W2V

In [None]:
'''
# Reading glove vectors in python: https://stackoverflow.com/a/38230349/4084039
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    f = open(gloveFile,'r', encoding="utf8")
    model = {}
    for line in tqdm(f):
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
model = loadGloveModel('glove.42B.300d.txt')

# ============================
Output:
    
Loading Glove Model
1917495it [06:32, 4879.69it/s]
Done. 1917495  words loaded!

# ============================

words = []
for i in preproced_texts:
    words.extend(i.split(' '))

for i in preproced_titles:
    words.extend(i.split(' '))
print("all the words in the coupus", len(words))
words = set(words)
print("the unique words in the coupus", len(words))

inter_words = set(model.keys()).intersection(words)
print("The number of words that are present in both glove vectors and our coupus", \
      len(inter_words),"(",np.round(len(inter_words)/len(words)*100,3),"%)")

words_courpus = {}
words_glove = set(model.keys())
for i in words:
    if i in words_glove:
        words_courpus[i] = model[i]
print("word 2 vec length", len(words_courpus))


# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/

import pickle
with open('content/drive/My Drive/IPYNB/glove_vectors', 'wb') as f:
    pickle.dump(words_courpus, f)


'''

In [None]:
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

In [None]:
# average Word2Vec
# compute average word2vec for each review.
avg_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors.append(vector)

print(len(avg_w2v_vectors))
print(len(avg_w2v_vectors[0]))

#### 1.5.2.3 Using Pretrained Models: TFIDF weighted W2V

In [None]:
tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(preprocessed_essays)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

In [None]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors.append(vector)

print(len(tfidf_w2v_vectors))
print(len(tfidf_w2v_vectors[0]))

In [None]:
# Similarly you can vectorize for title also

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(preprocessed_title)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

# compute average word2vec for each review.
tfidf_w2v_title = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_title): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_title.append(vector)

print(len(tfidf_w2v_title))
print(len(tfidf_w2v_title[0]))

### 1.5.3 Vectorizing Numerical features

In [None]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data = pd.merge(project_data, price_data, on='id', how='left')

In [None]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329.   ... 399.   287.73   5.5 ].
# Reshape your data either using array.reshape(-1, 1)

price_scalar = StandardScaler()
price_scalar.fit(project_data['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_standardized = price_scalar.transform(project_data['price'][:50000].values.reshape(-1, 1))


price_scalar = StandardScaler()
price_scalar.fit(project_data['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# quantity
quantity_scalar = StandardScaler()
quantity_scalar.fit(project_data['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

pre_quantity = quantity_scalar.transform(project_data['quantity'][:50000].values.reshape(-1, 1))

#teacher_number_of_previously_posted_projects

teacher_scalar = StandardScaler()
teacher_scalar.fit(project_data['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

pre_teacher_number_of_previously_posted_projects = teacher_scalar.transform(project_data['teacher_number_of_previously_posted_projects'][:50000].values.reshape(-1, 1))

In [None]:
price_standardized

### 1.5.4 Merging all the above features

- we need to merge all the numerical vectors i.e catogorical, text, numerical vectors

In [None]:
print(categories_one_hot.shape)
print(sub_categories_one_hot.shape)
print(essay_bow.shape)
print(price_standardized.shape)



In [None]:

project_data['project_essay_3'] = project_data['project_essay_3'].replace(np.nan, '', regex=True)
project_data['project_essay_4'] = project_data['project_essay_3'].replace(np.nan, '', regex=True)
project_data.dropna(inplace=True)
# project_data.index[project_data['teacher_prefix'].isna() == True].tolist()

# project_data.isna().any()

In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
# with the same hstack function we are concatinating a sparse matrix and a dense matirx :)

X = pd.DataFrame({"clean_subcategories":project_data['clean_subcategories'][:50000],
                   "clean_categories":project_data['clean_categories'][:50000],
                   "Price":project_data['price'][:50000],
                   "teacher_prefix":project_data['teacher_prefix'][:50000], 
                    "school_state":project_data['school_state'][:50000],
                    "project_grade_category":project_data['project_grade_category'][:50000],
                    "quantity":project_data['quantity'][:50000], 
                    "teacher_number_of_previously_posted_projects":project_data['teacher_number_of_previously_posted_projects'][:50000]
})

Y = project_data['project_is_approved'][:50000]

# X = hstack((categories_one_hot, sub_categories_one_hot, price_standardized, teacher_prefix_one_hot,
#             school_state_one_hot, project_grade_category_one_hot, pre_quantity, pre_teacher_number_of_previously_posted_projects))
# X = pd.SparseDataFrame(X)
# Y = np.array(Y).reshape(-1,1)
# Y.shape

In [None]:
print(X.shape)
print(Y.shape)


# Assignment 3: Apply KNN

<ol>
    <li><strong>[Task-1] Apply KNN(brute force version) on these feature sets</strong>
        <ul>
            <li><font color='red'>Set 1</font>: categorical, numerical features + project_title(BOW) + preprocessed_essay (BOW)</li>
            <li><font color='red'>Set 2</font>: categorical, numerical features + project_title(TFIDF)+  preprocessed_essay (TFIDF)</li>
            <li><font color='red'>Set 3</font>: categorical, numerical features + project_title(AVG W2V)+  preprocessed_essay (AVG W2V)</li>
            <li><font color='red'>Set 4</font>: categorical, numerical features + project_title(TFIDF W2V)+  preprocessed_essay (TFIDF W2V)</li>
        </ul>
    </li>
    <br>
    <li><strong>Hyper paramter tuning to find best K</strong>
        <ul>
    <li>Find the best hyper parameter which results in the maximum <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/receiver-operating-characteristic-curve-roc-curve-and-auc-1/'>AUC</a> value</li>
    <li>Find the best hyper paramter using k-fold cross validation (or) simple cross validation data</li>
    <li>Use gridsearch-cv or randomsearch-cv or  write your own for loops to do this task</li>
        </ul>
    </li>
    <br>
    <li>
    <strong>Representation of results</strong>
        <ul>
    <li>You need to plot the performance of model both on train data and cross validation data for each hyper parameter, as shown in the figure
    <img src='train_cv_auc.JPG' width=300px></li>
    <li>Once you find the best hyper parameter, you need to train your model-M using the best hyper-param. Now, find the AUC on test data and plot the ROC curve on both train and test using model-M.
    <img src='train_test_auc.JPG' width=300px></li>
    <li>Along with plotting ROC curve, you need to print the <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/confusion-matrix-tpr-fpr-fnr-tnr-1/'>confusion matrix</a> with predicted and original labels of test data points
    <img src='confusion_matrix.png' width=300px></li>
        </ul>
    </li>
    <li><strong> [Task-2] </strong>
        <ul>
            <li>Select top 2000 features from feature <font color='red'>Set 2</font> using <a href='https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html'>`SelectKBest`</a>
and then apply KNN on top of these features</li>
            <li>
                <pre>
                from sklearn.datasets import load_digits
                from sklearn.feature_selection import SelectKBest, chi2
                X, y = load_digits(return_X_y=True)
                X.shape
                X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
                X_new.shape
                ========
                output:
                (1797, 64)
                (1797, 20)
                </pre>
            </li>
            <li>Repeat the steps 2 and 3 on the data matrix after feature selection</li>
        </ul>
    </li>
    <br>
    <li><strong>Conclusion</strong>
        <ul>
    <li>You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library<a href='http://zetcode.com/python/prettytable/'> link</a> 
        <img src='summary.JPG' width=400px>
    </li>
        </ul>
</ol>

<h4><font color='red'>Note: Data Leakage</font></h4>

1. There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
2. To avoid the issue of data-leakag, make sure to split your data first and then vectorize it. 
3. While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
4. For more details please go through this <a href='https://soundcloud.com/applied-ai-course/leakage-bow-and-tfidf'>link.</a>

<h1>2. K Nearest Neighbor</h1>

<h2>2.1 Splitting data into Train and cross validation(or test): Stratified Sampling</h2>

In [None]:
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# when you plot any graph make sure you use 
    # a. Title, that describes your plot, this will be very helpful to the reader
    # b. Legends if needed
    # c. X-axis label
    # d. Y-axis label

#Splitting of Numerical data
from sklearn.model_selection import train_test_split
X_train1n, X_testn, Y_train1n, Y_testn = train_test_split(X,Y, test_size=0.20, random_state = 0)
X_trainn, X_cross_valn, Y_trainn, Y_cross_valn = train_test_split(X_train1n,Y_train1n, test_size=0.20, random_state = 0)
# X_trainn , Y_trainn, X_cross_valn, Y_cross_valn, X_testn, Y_testn

#Splitting of text data
text = pd.DataFrame({'Title': preprocessed_title, "Essays":preprocessed_essays})
from sklearn.model_selection import train_test_split
X_train1t, X_testt, Y_train1t, Y_testt = train_test_split(text,Y, test_size=0.20, random_state = 0)
X_traint, X_cross_valt, Y_traint, Y_cross_valt = train_test_split(X_train1t,Y_train1t, test_size=0.20, random_state = 0)
# X_traint, Y_traint, X_cross_valt, Y_c, ross_valt, X_testt, Y_testt



<h2>2.2 Make Data Model Ready: encoding numerical, categorical features</h2>

In [None]:
# # please write all the code with proper documentation, and proper titles for each subsection
# # go through documentations and blogs before you start coding 
# # first figure out what to do, and then think about how to do.
# # reading and understanding error messages will be very much helpfull in debugging your code
# # make sure you featurize train and test data separatly

# # when you plot any graph make sure you use 
#     # a. Title, that describes your plot, this will be very helpful to the reader
#     # b. Legends if needed
#     # c. X-axis label
#     # d. Y-axis label
    
# # Encoding numeric data points


# One hot-encoding of "teacher_prefix"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_teacher_trained = CountVectorizer(vocabulary=list(set(X_trainn['teacher_prefix'][:50000])), lowercase=False, binary=True)
X_trainn_teacher_prefix_one_hot = vectorizer_teacher_trained.fit_transform(X_trainn['teacher_prefix'][:50000].values)
print(vectorizer_teacher_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_trainn_teacher_prefix_one_hot.shape)



# One hot-encoding of "teacher_prefix"
from sklearn.feature_extraction.text import CountVectorizer
# vectorizer = CountVectorizer(vocabulary=list(set(X_cross_valn['teacher_prefix'][:50000])), lowercase=False, binary=True)
X_cross_valn_teacher_prefix_one_hot = vectorizer_teacher_trained.transform(X_cross_valn['teacher_prefix'][:50000].values)
print(vectorizer_teacher_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_cross_valn_teacher_prefix_one_hot.shape)



# One hot-encoding of "teacher_prefix"
from sklearn.feature_extraction.text import CountVectorizer
# vectorizer = CountVectorizer(vocabulary=list(set(X_testn['teacher_prefix'][:50000])), lowercase=False, binary=True)
X_testn_teacher_prefix_one_hot = vectorizer_teacher_trained.transform(X_testn['teacher_prefix'][:50000].values)
print(vectorizer_teacher_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_testn_teacher_prefix_one_hot.shape)

# #---------------------------------------------------------------------------------------------------------------------
# # One hot-encoding of "school_state"

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_state_trained = CountVectorizer(vocabulary=list(set(X_trainn['school_state'][:50000])), lowercase=False, binary=True)
X_trainn_school_state_one_hot = vectorizer_state_trained.fit_transform(X_trainn['school_state'][:50000].values)
print(vectorizer_state_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_trainn_school_state_one_hot.shape)


# vectorizer = CountVectorizer(vocabulary=list(set(X_cross_valn['school_state'][:50000])), lowercase=False, binary=True)
X_cross_valn_school_state_one_hot = vectorizer_state_trained.transform(X_cross_valn['school_state'][:50000].values)
print(vectorizer_state_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_cross_valn_school_state_one_hot.shape)

# vectorizer = CountVectorizer(vocabulary=list(set(X_testn['school_state'][:50000])), lowercase=False, binary=True)
X_testn_school_state_one_hot = vectorizer_state_trained.transform(X_testn['school_state'][:50000].values)
print(vectorizer_state_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_testn_school_state_one_hot.shape)

# #---------------------------------------------------------------------------------------------------------------------
# # One hot-encoding of "project_grade_category"

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_grade_trained = CountVectorizer(vocabulary=list(set(X_trainn['project_grade_category'][:50000])), lowercase=False, binary=True)
X_trainn_project_grade_category_one_hot = vectorizer_grade_trained.fit_transform(X_trainn['project_grade_category'][:50000].values)
print(vectorizer_grade_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_trainn_project_grade_category_one_hot.shape)

# vectorizer = CountVectorizer(vocabulary=list(set(X_cross_valn['project_grade_category'][:50000])), lowercase=False, binary=True)
X_cross_valn_project_grade_category_one_hot = vectorizer_grade_trained.transform(X_cross_valn['project_grade_category'][:50000].values)
print(vectorizer_grade_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_cross_valn_project_grade_category_one_hot.shape)

# vectorizer = CountVectorizer(vocabulary=list(set(X_testn['project_grade_category'][:50000])), lowercase=False, binary=True)
X_testn_project_grade_category_one_hot = vectorizer_grade_trained.transform(X_testn['project_grade_category'][:50000].values)
print(vectorizer_grade_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_testn_project_grade_category_one_hot.shape)

#---------------------------------------------------------------------------------------------------------------------
# Preprocessing of "project_resource_summary"


project_data['project_resource_summary']
# similarly you can preprocess the titles also
project_data['project_resource_summary']
preprocessed_project_resource_summary = []
for sentance in tqdm(project_data['project_resource_summary'][:50000]):
    s = sentance.replace("\\r", "")
    s = s.replace("\\n", "")
    s = decontracted(sentance)
    s = re.sub('[^A-Za-z0-9]+', ' ', sentance)    
    preprocessed_project_resource_summary.append(s)

# #---------------------------------------------------------------------------------------------------------------------
# of "teacher_number_of_previously_posted_projects" Standardization

teacher_scalar = StandardScaler()
teacher_scalar.fit(X_trainn['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_trainn_pre_teacher_number_of_previously_posted_projects = teacher_scalar.transform(X_trainn['teacher_number_of_previously_posted_projects'][:50000].values.reshape(-1, 1))



teacher_scalar = StandardScaler()
teacher_scalar.fit(X_cross_valn['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_cross_valn_pre_teacher_number_of_previously_posted_projects = teacher_scalar.transform(X_cross_valn['teacher_number_of_previously_posted_projects'][:50000].values.reshape(-1, 1))



teacher_scalar = StandardScaler()
teacher_scalar.fit(X_testn['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_testn_pre_teacher_number_of_previously_posted_projects = teacher_scalar.transform(X_testn['teacher_number_of_previously_posted_projects'][:50000].values.reshape(-1, 1))
# #---------------------------------------------------------------------------------------------------------------------
#  One hot-encoding of "clean_categories" 

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_clean_categories_trained = CountVectorizer(vocabulary=list(set(X_trainn['clean_categories'][:50000])), lowercase=False, binary=True)
X_trainn_clean_categories_one_hot = vectorizer_clean_categories_trained.fit_transform(X_trainn['clean_categories'][:50000].values)
print(vectorizer_clean_categories_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_trainn_clean_categories_one_hot.shape)


# vectorizer = CountVectorizer(vocabulary=list(set(X_cross_valn['clean_categories'][:50000])), lowercase=False, binary=True)
X_cross_valn_clean_categories_one_hot = vectorizer_clean_categories_trained.transform(X_cross_valn['clean_categories'][:50000].values)
print(vectorizer_clean_categories_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_cross_valn_clean_categories_one_hot.shape)


# vectorizer = CountVectorizer(vocabulary=list(set(X_testn['clean_categories'][:50000])), lowercase=False, binary=True)
X_testn_clean_categories_one_hot = vectorizer_clean_categories_trained.transform(X_testn['clean_categories'][:50000].values)
print(vectorizer_clean_categories_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_testn_clean_categories_one_hot.shape)

# #---------------------------------------------------------------------------------------------------------------------
# # One hot-encoding of  'clean_subcategories'


from sklearn.feature_extraction.text import CountVectorizer
vectorizer_clean_subcategories_trained = CountVectorizer(vocabulary=list(set(X_trainn['clean_subcategories'][:50000])), lowercase=False, binary=True)
X_trainn_clean_subcategories_one_hot = vectorizer_clean_subcategories_trained.fit_transform(X_trainn['clean_subcategories'][:50000].values)
print(vectorizer_clean_subcategories_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_trainn_clean_subcategories_one_hot.shape)

from sklearn.feature_extraction.text import CountVectorizer
# vectorizer = CountVectorizer(vocabulary=list(set(X_cross_valn['clean_subcategories'][:50000])), lowercase=False, binary=True)
X_cross_valn_clean_subcategories_one_hot = vectorizer_clean_subcategories_trained.transform(X_cross_valn['clean_subcategories'][:50000].values)
print(vectorizer_clean_subcategories_trained.get_feature_names())
print("Shape of matrix after one hot encodig ",X_cross_valn_clean_subcategories_one_hot.shape)

from sklearn.feature_extraction.text import CountVectorizer
# vectorizer = CountVectorizer(vocabulary=list(set(X_testn['clean_subcategories'][:50000])), lowercase=False, binary=True)
X_testn_clean_subcategories_one_hot = vectorizer_clean_subcategories_trained.transform(X_testn['clean_subcategories'][:50000].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",X_testn_clean_subcategories_one_hot.shape)
# #---------------------------------------------------------------------------------------------------------------------
# of "price" Standardization
# pre_price = preprocessing.scale(project_data['price'][5000:6000])

price_scalar = StandardScaler()
price_scalar.fit(X_trainn['Price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_trainn_pre_price = price_scalar.transform(X_trainn['Price'][:50000].values.reshape(-1, 1))


price_scalar = StandardScaler()
price_scalar.fit(X_cross_valn['Price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_cross_valn_pre_price = price_scalar.transform(X_cross_valn['Price'][:50000].values.reshape(-1, 1))


price_scalar = StandardScaler()
price_scalar.fit(X_testn['Price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_testn_pre_price = price_scalar.transform(X_testn['Price'][:50000].values.reshape(-1, 1))

# #---------------------------------------------------------------------------------------------------------------------
# of "quantity" Standardization


# X_trainn , X_cross_valn, X_testn
# done'clean_subcategories', done'clean_categories', done'Price',            done'teacher_prefix',
#        done'school_state', done'project_grade_category', done'quantity',
#        done'teacher_number_of_previously_posted_projects'


quantity_scalar = StandardScaler()
quantity_scalar.fit(X_trainn['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_trainn_pre_quantity = quantity_scalar.transform(X_trainn['quantity'][:50000].values.reshape(-1, 1))


quantity_scalar = StandardScaler()
quantity_scalar.fit(X_cross_valn['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_cross_valn_pre_quantity = quantity_scalar.transform(X_cross_valn['quantity'][:50000].values.reshape(-1, 1))


quantity_scalar = StandardScaler()
quantity_scalar.fit(X_testn['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data

X_testn_pre_quantity = quantity_scalar.transform(X_testn['quantity'][:50000].values.reshape(-1, 1))
#--------------------------------------------------------------------------------------------------------------------
# X_trainn_teacher_prefix_one_hot
# X_cross_valn_teacher_prefix_one_hot
# X_testn_teacher_prefix_one_hot
# X_trainn_school_state_one_hot
# X_cross_valn_school_state_one_hot
# X_testn_school_state_one_hot
# X_trainn_project_grade_category_one_hot
# X_cross_valn_project_grade_category_one_hot
# X_testn_project_grade_category_one_hot
# X_trainn_pre_teacher_number_of_previously_posted_projects
# X_cross_valn_pre_teacher_number_of_previously_posted_projects
# X_testn_pre_teacher_number_of_previously_posted_projects
# X_trainn_clean_categories_one_hot
# X_cross_valn_clean_categories_one_hot
# X_testn_clean_categories_one_hot
# X_trainn_clean_subcategories_one_hot
# X_cross_valn_clean_subcategories_one_hot
# X_testn_clean_subcategories_one_hot
# X_trainn_pre_price
# X_cross_valn_pre_price
# X_testn_pre_price
# X_trainn_pre_quantity
# X_cross_valn_pre_quantity
# X_testn_pre_quantity

<h2>2.3 Make Data Model Ready: encoding essay, and project_title</h2>

In [None]:
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# make sure you featurize train and test data separatly

# when you plot any graph make sure you use 
    # a. Title, that describes your plot, this will be very helpful to the reader
    # b. Legends if needed
    # c. X-axis label
    # d. Y-axis label

###### TRAIN
#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_title"(Bag of Words)

vectorizer_bow_title_trained = CountVectorizer(min_df=10, max_features = 5000)
title_bow_train = vectorizer_bow_title_trained.fit_transform(X_traint['Title'])
print("Shape of matrix after one hot encodig ",title_bow_train.shape)

#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_essays"(Bag of Words)

vectorizer_bow_essay_trained = CountVectorizer(min_df=10, max_features = 5000)
essay_bow_train = vectorizer_bow_essay_trained.fit_transform(X_traint['Essays'])
print("Shape of matrix after one hot encodig ",essay_bow_train.shape)

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_essays"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_essays_train = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_traint['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_essays_train.append(vector)

print(len(avg_w2v_preprocessed_essays_train))

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_title"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_title_train = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_traint['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_title_train.append(vector)

print(len(avg_w2v_preprocessed_title_train))

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_essays" TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tfidf_essays_trained = TfidfVectorizer(min_df=10, max_features= 5000)
essays_tfidf_train = vectorizer_tfidf_essays_trained.fit_transform(X_traint['Essays'])
print("Shape of matrix after one hot encodig ",essays_tfidf_train.shape)

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_title" TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tfidf_title_trained = TfidfVectorizer(min_df=10, max_features= 5000)
title_tfidf_train = vectorizer_tfidf_title_trained.fit_transform(X_traint['Title'])
print("Shape of matrix after one hot encodig ",title_tfidf_train.shape)
#--------------------------------------------------------------------------------------------------------------------
# TDIF W2V of "Essays" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_traint['Essays'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Essay_train = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_traint['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Essay_train.append(vector)


#--------------------------------------------------------------------------------------------------------------------

# TDIF BOW of "Title" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_traint['Title'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Titles_train = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_traint['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Titles_train.append(vector)



###### TEST
#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_title"(Bag of Words)

# vectorizer = CountVectorizer(min_df=10, max_features = 5000)
title_bow_test = vectorizer_bow_title_trained.transform(X_testt['Title'])
print("Shape of matrix after one hot encodig ",title_bow_test.shape)

#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_essays"(Bag of Words)

# vectorizer = CountVectorizer(min_df=10, max_features = 5000)
essay_bow_test = vectorizer_bow_essay_trained.transform(X_testt['Essays'])
print("Shape of matrix after one hot encodig ",essay_bow_test.shape)

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_essays"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_essays_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_testt['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_essays_test.append(vector)

print(len(avg_w2v_preprocessed_essays_test))

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_title"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_title_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_testt['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_title_test.append(vector)

print(len(avg_w2v_preprocessed_title_test))

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_essays" TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(min_df=10 , max_features= 5000)

essays_tfidf_test = vectorizer_tfidf_essays_trained.transform(X_testt['Essays'])
print("Shape of matrix after one hot encodig ",essays_tfidf_test.shape)

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_title" TFIDF

# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(min_df=10 , max_features= 5000)
title_tfidf_test = vectorizer_tfidf_title_trained.transform(X_testt['Title'])
print("Shape of matrix after one hot encodig ",title_tfidf_test.shape)
#--------------------------------------------------------------------------------------------------------------------
# TDIF W2V of "Essays" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_testt['Essays'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Essay_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_testt['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Essay_test.append(vector)


#--------------------------------------------------------------------------------------------------------------------

# TDIF BOW of "Title" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_testt['Title'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Titles_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_testt['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Titles_test.append(vector)


###### CROSS_VAL
#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_title"(Bag of Words)

# vectorizer = CountVectorizer(min_df=10, max_features = 5000)
title_bow_cross_val = vectorizer_bow_title_trained.transform(X_cross_valt['Title'])
print("Shape of matrix after one hot encodig ",title_bow_cross_val.shape)

#--------------------------------------------------------------------------------------------------------------------
# BOW of "preprocessed_essays"(Bag of Words)

# vectorizer = CountVectorizer(min_df=10, max_features = 5000)
essay_bow_cross_val = vectorizer_bow_essay_trained.transform(X_cross_valt['Essays'])
print("Shape of matrix after one hot encodig ",essay_bow_cross_val.shape)

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_essays"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_essays_cross_val = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cross_valt['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_essays_cross_val.append(vector)

print(len(avg_w2v_preprocessed_essays_cross_val))

#--------------------------------------------------------------------------------------------------------------------
# AVG W2V of "preprocessed_title"
with open('../input/glove-vectors/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
avg_w2v_preprocessed_title_cross_val = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cross_valt['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_preprocessed_title_cross_val.append(vector)

print(len(avg_w2v_preprocessed_title_cross_val))

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_essays" TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(min_df=10, max_features= 5000)
essays_tfidf_cross_val = vectorizer_tfidf_essays_trained.transform(X_cross_valt['Essays'])
print("Shape of matrix after one hot encodig ",essays_tfidf_cross_val.shape)

#---------------------------------------------------------------------------------------------------------------------
# "preprocessed_title" TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(min_df=10 , max_features= 5000)
title_tfidf_cross_val = vectorizer_tfidf_title_trained.transform(X_cross_valt['Title'])
print("Shape of matrix after one hot encodig ",title_tfidf_cross_val.shape)
#--------------------------------------------------------------------------------------------------------------------
# TDIF W2V of "Essays" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_cross_valt['Essays'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Essay_cross_val = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cross_valt['Essays']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Essay_cross_val.append(vector)


#--------------------------------------------------------------------------------------------------------------------

# TDIF BOW of "Title" 

tfidf_model = TfidfVectorizer(max_features= 5000)
tfidf_model.fit(X_cross_valt['Title'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model.get_feature_names(), list(tfidf_model.idf_)))
tfidf_words = set(tfidf_model.get_feature_names())

tfidf_w2v_Titles_cross_val = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cross_valt['Title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_Titles_cross_val.append(vector)


In [None]:
# vectorizer_bow_title_trained = CountVectorizer(min_df=10, max_features = 5000)
# title_bow_train = vectorizer_bow_trained.fit_transform(X_traint['Title'])
# print("Shape of matrix after one hot encodig ",title_bow_train.shape)

# title_bow_test = vectorizer_bow_title_trained.transform(X_testt['Title'])
# print("Shape of matrix after one hot encodig ",title_bow_test.shape)
# # title_bow_cross_val = vectorizer_bow_title_trained.transform(X_cross_valt['Title'])
# # print("Shape of matrix after one hot encodig ",title_bow_cross_val.shape)

In [None]:
X_testn_pre_teacher_number_of_previously_posted_projects.shape

In [None]:
X_preprocessed_num_test = hstack(( X_testn_teacher_prefix_one_hot,
                                X_testn_school_state_one_hot,
                                X_testn_project_grade_category_one_hot,
                                X_testn_pre_teacher_number_of_previously_posted_projects,
                                X_testn_clean_categories_one_hot,
                                X_testn_clean_subcategories_one_hot,
                                X_testn_pre_price,
                                X_testn_pre_quantity))

X_preprocessed_num_train = hstack((X_trainn_teacher_prefix_one_hot,
                                    X_trainn_school_state_one_hot,
                                    X_trainn_project_grade_category_one_hot,
                                    X_trainn_pre_teacher_number_of_previously_posted_projects,
                                    X_trainn_clean_categories_one_hot,
                                    X_trainn_clean_subcategories_one_hot,
                                    X_trainn_pre_price,
                                    X_trainn_pre_quantity))

X_preprocessed_num_crossval = hstack((X_cross_valn_teacher_prefix_one_hot,
                                    X_cross_valn_school_state_one_hot,
                                    X_cross_valn_project_grade_category_one_hot,
                                    X_cross_valn_pre_teacher_number_of_previously_posted_projects,
                                    X_cross_valn_clean_categories_one_hot,
                                    X_cross_valn_clean_subcategories_one_hot,
                                    X_cross_valn_pre_price,
                                    X_cross_valn_pre_quantity))


In [None]:
# X_trainn , Y_trainn, X_cross_valn, Y_cross_valn, X_testn, Y_testn
# X_traint, Y_traint, X_cross_valt, Y_c, ross_valt, X_testt, Y_testt


<h2>2.4 Appling KNN on different kind of featurization as mentioned in the instructions</h2>
<br>Apply KNN on different kind of featurization as mentioned in the instructions
<br> For Every model that you work on make sure you do the step 2 and step 3 of instructions


In [None]:
# https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix

# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code

# when you plot any graph make sure you use 
    # a. Title, that describes your plot, this will be very helpful to the reader
    # b. Legends if needed
    # c. X-axis label
    # d. Y-axis label

from scipy.sparse import hstack
from scipy.sparse import csr_matrix

BOW_data_X_train = hstack((X_preprocessed_num_train, title_bow_train, essay_bow_train))

BOW_data_Y_train = np.array(Y_traint).reshape(-1,1)
# np.array_equal(Y_traint, Y_trainn) ==> True

# __________________________________________________________________________________________________________________
TFIDF_data_X_train = hstack((X_preprocessed_num_train, essays_tfidf_train, title_tfidf_train))

TFIDF_data_Y_train = np.array(Y_traint).reshape(-1,1)

# __________________________________________________________________________________________________________________

AVGW2V_data_X_train = hstack((X_preprocessed_num_train, avg_w2v_preprocessed_essays_train, avg_w2v_preprocessed_title_train))

AVGW2V_data_Y_train = np.array(Y_traint).reshape(-1,1)
# np.array_equal(Y_traint, Y_trainn) ==> True

# # __________________________________________________________________________________________________________________

TFIDF_w2v_data_X_train = hstack((X_preprocessed_num_train, tfidf_w2v_Essay_train, tfidf_w2v_Titles_train))

TFIDF_w2v_data_Y_train = np.array(Y_traint).reshape(-1,1)
# np.array_equal(Y_traint, Y_trainn) ==> True

######################################################


BOW_data_X_test = hstack((X_preprocessed_num_test, title_bow_test, essay_bow_test))

BOW_data_Y_test = np.array(Y_testn).reshape(-1,1)

# __________________________________________________________________________________________________________________
TFIDF_data_X_test = hstack((X_preprocessed_num_test, essays_tfidf_test, title_tfidf_test))

TFIDF_data_Y_test = np.array(Y_testn).reshape(-1,1)

# __________________________________________________________________________________________________________________

AVGW2V_data_X_test = hstack((X_preprocessed_num_test, avg_w2v_preprocessed_essays_test, avg_w2v_preprocessed_title_test))

AVGW2V_data_Y_test = np.array(Y_testn).reshape(-1,1)

# # __________________________________________________________________________________________________________________

TFIDF_w2v_data_X_test = hstack((X_preprocessed_num_test, tfidf_w2v_Essay_test, tfidf_w2v_Titles_test))

TFIDF_w2v_data_Y_test = np.array(Y_testn).reshape(-1,1)



######################################################


BOW_data_X_cross_val = hstack((X_preprocessed_num_crossval, title_bow_cross_val, essay_bow_cross_val))

BOW_data_Y_cross_val = np.array(Y_cross_valn).reshape(-1,1)

# __________________________________________________________________________________________________________________
TFIDF_data_X_cross_val = hstack((X_preprocessed_num_crossval, essays_tfidf_cross_val, title_tfidf_cross_val))

TFIDF_data_Y_cross_val = np.array(Y_cross_valn).reshape(-1,1)

# __________________________________________________________________________________________________________________

AVGW2V_data_Xcross_val = hstack((X_preprocessed_num_crossval, avg_w2v_preprocessed_essays_cross_val, avg_w2v_preprocessed_title_cross_val))

AVGW2V_data_Y_cross_val = np.array(Y_cross_valn).reshape(-1,1)

# # __________________________________________________________________________________________________________________

TFIDF_w2v_data_X_cross_val = hstack((X_preprocessed_num_crossval, tfidf_w2v_Essay_cross_val, tfidf_w2v_Titles_cross_val))

TFIDF_w2v_data_Y_cross_val = np.array(Y_cross_valn).reshape(-1,1)


### 2.4.1 Applying KNN brute force on BOW,<font color='red'> SET 1</font>

In [None]:
print(BOW_data_X_train.shape)
print(BOW_data_X_cross_val.shape)

In [None]:
# https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
    
import warnings
warnings.filterwarnings('ignore')


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

train_auc = []
cv_auc = []
K = [1, 5, 10, 21, 31, 41, 51, 101]
for i in tqdm(K):
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(BOW_data_X_train, BOW_data_Y_train)
    
    y_train_pred = neigh.predict_proba(BOW_data_X_train)[:,1]
    y_cv_pred =  neigh.predict_proba(BOW_data_X_cross_val)[:,1]
    train_auc.append(roc_auc_score(BOW_data_Y_train,y_train_pred))
    cv_auc.append(roc_auc_score(BOW_data_Y_cross_val, y_cv_pred))

plt.plot(K, train_auc, label='Train AUC')
plt.plot(K, cv_auc, label='CV AUC')

plt.scatter(K, train_auc, label='Train AUC points')
plt.scatter(K, cv_auc, label='CV AUC points')

plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()



In [None]:
    #   BOW_data_X_train
# BOW_data_Y_train
#     BOW_data_X_test
# BOW_data_Y_test
# BOW_data_X_cross_val
# BOW_data_Y_cross_val
knn_optimal = KNeighborsClassifier(n_neighbors=55)
knn_optimal.fit(BOW_data_X_train, BOW_data_Y_train)

pred_cross_val = knn_optimal.predict(BOW_data_X_cross_val)
BOW_data_X_cross_val.shape

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
fpr_cross_val, tpr_cross_val, threshold_cv = metrics.roc_curve(BOW_data_Y_cross_val, pred_cross_val)
roc_auc_BOW_cv = metrics.auc(fpr_cross_val, tpr_cross_val)
print("AUC Value Cross validation: ", str(roc_auc_BOW_cv))

pred_test = knn_optimal.predict(BOW_data_X_test)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
fpr_test, tpr_test, threshold_test = metrics.roc_curve(BOW_data_Y_test, pred_test)

print("True Positive Rate Test:"+str(tpr_test[1]))
print("False Positive Rate Test:"+str(1 - tpr_test[1]))
print("True Positive Rate Test:"+str(fpr_test[1]))
print("True Positive Rate Test:"+str(1 - fpr_test[1]))


roc_auc_BOW_test = metrics.auc(fpr_test, tpr_test)
print("AUC Value on Test data: ", str(roc_auc_BOW_test))

import matplotlib.pyplot as plt
plt.title('ROC Curve on KNN brute force on BOW data')

# plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
# plt.plot(test_fpr, test_tpr, label="train AUC ="+str(auc(test_fpr, test_tpr)))

plt.plot(fpr_test, tpr_test, label = 'AUC of test data='+str(roc_auc_BOW_test))
plt.plot(fpr_cross_val, tpr_cross_val, label = 'AUC of Crossval='+ str(roc_auc_BOW_cv))

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#Confusion matrix code from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
from sklearn.metrics import confusion_matrix
print("CONFUSION MATRIX: ")

df_cm = pd.DataFrame(confusion_matrix(pred_test, BOW_data_Y_test), index = ['Positive', 'Negative'],
                  columns = ['Positive', 'Negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

### 2.4.2 Applying KNN brute force on TFIDF,<font color='red'> SET 2</font>

In [None]:
# https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
import warnings
warnings.filterwarnings('ignore')

#_________________________________________________________________________________________________________________
#Train Test split 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score



train_auc = []
cv_auc = []
K = [1, 5, 10, 21, 31, 41, 51, 101]
for i in tqdm(K):
  neigh = KNeighborsClassifier(n_neighbors=i)
  neigh.fit(TFIDF_data_X_train, TFIDF_data_Y_train)
  
  y_train_pred = neigh.predict_proba(TFIDF_data_X_train)[:,1]
  y_cv_pred =  neigh.predict_proba(TFIDF_data_X_cross_val)[:,1]

       
  train_auc.append(roc_auc_score(TFIDF_data_Y_train,y_train_pred))
  cv_auc.append(roc_auc_score(TFIDF_data_Y_cross_val, y_cv_pred))

plt.plot(K, train_auc, label='Train AUC')
plt.plot(K, cv_auc, label='CV AUC')

plt.scatter(K, train_auc, label='Train AUC points')
plt.scatter(K, cv_auc, label='CV AUC points')

plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()



In [None]:
# TFIDF_data_X_train
# TFIDF_data_Y_train
# TFIDF_data_X_test
# TFIDF_data_Y_test
# TFIDF_data_X_cross_val
# TFIDF_data_Y_cross_val
knn_optimal = KNeighborsClassifier(n_neighbors=50)
knn_optimal.fit(TFIDF_data_X_train, TFIDF_data_Y_train)

pred_cross_val = knn_optimal.predict(TFIDF_data_X_cross_val)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_cross_val, tpr_cross_val, threshold = metrics.roc_curve(TFIDF_data_Y_cross_val, pred_cross_val)
roc_auc_TFIDF_crossval = metrics.auc(fpr_cross_val, tpr_cross_val)
print("AUC Value Cross Validation: ", str(roc_auc_TFIDF_crossval))

pred_test = knn_optimal.predict(TFIDF_data_X_test)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_test, tpr_test, threshold = metrics.roc_curve(TFIDF_data_Y_test, pred_test)
roc_auc_TFIDF_test = metrics.auc(fpr_test, tpr_test)
print("AUC Value of Test Data: ", str(roc_auc_TFIDF_test))


print("True Positive Rate Test:"+str(tpr_test[1]))
print("False Positive Rate Test:"+str(1 - tpr_test[1]))
print("True Positive Rate Test:"+str(fpr_test[1]))
print("True Positive Rate Test:"+str(1 - fpr_test[1]))

import matplotlib.pyplot as plt
plt.title('ROC Curve on KNN brute force on TFIDF data')
plt.plot(fpr_test, tpr_test, label = 'AUC test= '+str(roc_auc_TFIDF_test))
plt.plot(fpr_cross_val, tpr_cross_val, label = 'AUC crossval= '+str( roc_auc_TFIDF_crossval))

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#Confusion matrix code from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

from sklearn.metrics import confusion_matrix
print("CONFUSION MATRIX: ")

df_cm = pd.DataFrame(confusion_matrix(pred_test, TFIDF_data_Y_test), index = ['Positive', 'Negative'],
                  columns = ['Positive', 'Negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True,fmt='g')

### 2.4.3 Applying KNN brute force on AVG W2V,<font color='red'> SET 3</font>

In [None]:
# https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
import warnings
warnings.filterwarnings('ignore')

#_________________________________________________________________________________________________________________
#Train Test split 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


train_auc = []
cv_auc = []
K = [1, 5,15, 21, 31, 41, 51, 101]
for i in tqdm(K):
  neigh = KNeighborsClassifier(n_neighbors=i)
  neigh.fit(AVGW2V_data_X_train, AVGW2V_data_Y_train)
  
  y_train_pred = neigh.predict_proba(AVGW2V_data_X_train)[:,1]
  y_cv_pred =  neigh.predict_proba(AVGW2V_data_Xcross_val)[:,1]

       
  train_auc.append(roc_auc_score(AVGW2V_data_Y_train,y_train_pred))
  cv_auc.append(roc_auc_score(AVGW2V_data_Y_cross_val, y_cv_pred))

plt.plot(K, train_auc, label='Train AUC')
plt.plot(K, cv_auc, label='CV AUC')

plt.scatter(K, train_auc, label='Train AUC points')
plt.scatter(K, cv_auc, label='CV AUC points')

plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
# AVGW2V_data_X_train
# AVGW2V_data_Y_train
# AVGW2V_data_X_test
# AVGW2V_data_Y_test
# AVGW2V_data_Xcross_val
# AVGW2V_data_Y_cross_val
knn_optimal = KNeighborsClassifier(n_neighbors=50)
knn_optimal.fit(AVGW2V_data_X_train, AVGW2V_data_Y_train)

pred_cross_val = knn_optimal.predict(AVGW2V_data_Xcross_val)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_cross_val, tpr_cross_val, threshold = metrics.roc_curve(AVGW2V_data_Y_cross_val, pred_cross_val)
roc_auc_AVGW2V_crossval = metrics.auc(fpr_cross_val, tpr_cross_val)
print("AUC Value Cross Validation: ", str(roc_auc_AVGW2V_crossval))

pred_test = knn_optimal.predict(data_X_test_set3)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_test, tpr_test, threshold = metrics.roc_curve(AVGW2V_data_Y_test, pred_test)
roc_auc_AVGW2V_test = metrics.auc(fpr_test, tpr_test)
print("AUC Value of Test Data: ", str(roc_auc_AVGW2V_test))

print("True Positive Rate Test:"+str(tpr_test[1]))
print("False Positive Rate Test:"+str(1 - tpr_test[1]))
print("True Positive Rate Test:"+str(fpr_test[1]))
print("True Positive Rate Test:"+str(1 - fpr_test[1]))


import matplotlib.pyplot as plt
plt.title('ROC Curve on brute force on AVG W2V data')
plt.plot(fpr_test, tpr_test, label = 'AUC test='+str(roc_auc_AVGW2V_test))
plt.plot(fpr_cross_val, tpr_cross_val, label = 'AUC cross val='+str(roc_auc_AVGW2V_crossval))
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#Confusion matrix code from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

from sklearn.metrics import confusion_matrix
print("CONFUSION MATRIX: ")

df_cm = pd.DataFrame(confusion_matrix(pred_test, data_Y_test_set3), index = ['Positive', 'Negative'],
                  columns = ['Positive', 'Negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

### 2.4.4 Applying KNN brute force on TFIDF W2V,<font color='red'> SET 4</font>

In [None]:
# Please write all the code with proper documentation
# https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
import warnings
warnings.filterwarnings('ignore')


#_________________________________________________________________________________________________________________
#Train Test split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


train_auc = []
cv_auc = []
K = [1, 5, 15, 21, 31, 41, 51, 101]
for i in tqdm(K):
  neigh = KNeighborsClassifier(n_neighbors=i)
  neigh.fit(TFIDF_w2v_data_X_train, TFIDF_w2v_data_Y_train)
  
  y_train_pred = neigh.predict_proba(TFIDF_w2v_data_X_train)[:,1]
  y_cv_pred =  neigh.predict_proba(TFIDF_w2v_data_X_cross_val)[:,1]

       
  train_auc.append(roc_auc_score(TFIDF_w2v_data_Y_train,y_train_pred))
  cv_auc.append(roc_auc_score(TFIDF_w2v_data_Y_cross_val, y_cv_pred))

plt.plot(K, train_auc, label='Train AUC')
plt.plot(K, cv_auc, label='CV AUC')

plt.scatter(K, train_auc, label='Train AUC points')
plt.scatter(K, cv_auc, label='CV AUC points')

plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()



In [None]:
# TFIDF_w2v_data_X_train
# TFIDF_w2v_data_Y_train
# TFIDF_w2v_data_X_test
# TFIDF_w2v_data_Y_test
# TFIDF_w2v_data_X_cross_val
# TFIDF_w2v_data_Y_cross_val

knn_optimal = KNeighborsClassifier(n_neighbors=40)
knn_optimal.fit(data_X_train, data_Y_train)

pred_cross_val = knn_optimal.predict(data_X_cross_val_set4)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_cross_val, tpr_cross_val, threshold = metrics.roc_curve(TFIDF_w2v_data_Y_cross_val, pred_cross_val)
roc_auc_TFIDFW2V_crossval = metrics.auc(fpr_cross_val, tpr_cross_val)
print("AUC Value Cross Validation: ", str(roc_auc_TFIDFW2V_crossval))

pred_test = knn_optimal.predict(data_X_test_set3)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_test, tpr_test, threshold = metrics.roc_curve(TFIDF_w2v_data_Y_test, pred_test)
roc_auc_TFIDFW2V_test = metrics.auc(fpr_test, tpr_test)
print("AUC Value of Test Data: ", str(roc_auc_TFIDFW2V_test))


print("True Positive Rate Test:"+str(tpr_test[1]))
print("False Positive Rate Test:"+str(1 - tpr_test[1]))
print("True Positive Rate Test:"+str(fpr_test[1]))
print("True Positive Rate Test:"+str(1 - fpr_test[1]))


import matplotlib.pyplot as plt
plt.title('ROC Curve on KNN brute force on TFIDF W2V data')
plt.plot(fpr_cross_val, tpr_cross_val, label = 'AUC Cross val=' +str(roc_auc_TFIDFW2V_crossval))
plt.plot(fpr_test, tpr_test, label = 'AUC test='+str(roc_auc_TFIDFW2V_test))
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#Confusion matrix code from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

from sklearn.metrics import confusion_matrix
print("CONFUSION MATRIX: ")

df_cm = pd.DataFrame(confusion_matrix(pred_test, data_Y_test_set4), index = ['Positive', 'Negative'],
                  columns = ['Positive', 'Negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

<h2>2.5 Feature selection with `SelectKBest` </h2>

In [0]:
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code

# when you plot any graph make sure you use 
    # a. Title, that describes your plot, this will be very helpful to the reader
    # b. Legends if needed
    # c. X-axis label
    # d. Y-axis label

# TFIDF_data_X_train
# TFIDF_data_Y_train
# TFIDF_data_X_test
# TFIDF_data_Y_test
# TFIDF_data_X_cross_val
# TFIDF_data_Y_cross_val

#Selecting K best features from the orignal dataset
from sklearn.feature_selection import SelectKBest, f_classif

kbest = SelectKBest(f_classif, k=2000).fit(TFIDF_data_X_train, TFIDF_data_Y_train)
X_train_best = kbest.transform(TFIDF_data_X_train)
X_test_best = kbest.transform(TFIDF_data_X_test)
X_cross_val_best = kbest.transform(TFIDF_data_X_cross_val)
# print(X_cross_val_best)

# https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
import warnings
warnings.filterwarnings('ignore')

#_________________________________________________________________________________________________________________

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


k_vals = [i for i in range(1,30)]
    
test_auc = []
cv_auc = []

for i in tqdm(k_vals):
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(X_train_best, data_Y_train)
    
  y_test_pred =  knn.predict_proba(X_test_best)[:,1]
  y_cv_pred =  knn.predict_proba(X_cross_val_best)[:,1]
    
  test_auc.append(roc_auc_score(TFIDF_data_Y_test,y_test_pred))
  cv_auc.append(roc_auc_score(TFIDF_data_Y_cross_val, y_cv_pred))
    

plt.plot(k_vals, test_auc, label='Test AUC')
plt.plot(k_vals, cv_auc, label='CV AUC')
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.show()



In [0]:
knn_optimal = KNeighborsClassifier(n_neighbors=6)
knn_optimal.fit(X_train_best, data_Y_train)

pred_cross_val = knn_optimal.predict(X_cross_val_best)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_crossval, tpr_crossval, threshold = metrics.roc_curve(data_Y_cross_val_set2, pred_cross_val)
roc_auc_TFIDF_crossval = metrics.auc(fpr_crossval, tpr_crossval)
print("AUC Value Cross Validation: ", str(roc_auc_TFIDF_crossval))

pred_test = knn_optimal.predict(X_test_best)

# Code of ROC from: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

fpr_test, tpr_test, threshold = metrics.roc_curve(data_Y_test_set2, pred_test)
roc_auc_TFIDF_test = metrics.auc(fpr_test, tpr_test)
print("AUC Value of Test Data: ", str(roc_auc_TFIDF_test))

import matplotlib.pyplot as plt
plt.title('ROC Curve on KNN brute force on TFIDF data')
plt.plot(fpr_test, tpr_test, label = 'AUC test='+ str(roc_auc_TFIDF_test))
plt.plot(fpr_crossval, tpr_crossval, label = 'AUC Crossval='+str(roc_auc_TFIDF_crossval))
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#Confusion matrix code from: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
from sklearn.metrics import confusion_matrix
print("CONFUSION MATRIX: ")


df_cm = pd.DataFrame(confusion_matrix(pred_test, data_Y_test_set2), index = ['Positive', 'Negative'],
                  columns = ['Positive', 'Negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

<h1>3. Conclusions</h1>

In [0]:
# Please compare all your models using Prettytable library

from prettytable import PrettyTable
x = PrettyTable()

x.field_names = ["Vectorizer", "Model", "Hyper Parameter", "AUC"]

x.add_row(["BOW", 'Brute', '50', roc_auc_BOW_test])
x.add_row(["TFIDF", 'Brute', '50', roc_auc_TFIDF_test])
x.add_row(["W2V", 'Brute', '50', roc_auc_AVGW2V_test])
x.add_row(["TFIDF W2V", 'Brute', '40', roc_auc_TFIDFW2V_test])
print(x)

From the conclusion table we can say that our dataset is imbalanced as we have AUC values for all the cases approximately as .50 which is same as a AUC value of a random model.
And we can also say that the "Accuracy metrics is not a good metrics as we are getting high accuracies but low AUC values which means that our model is kind of dumb."

After plotting the error plots, I found that after increasing the K value we can improve the ROC value but its taking a lot of computation power and a lot of time for running the model as KNN tend to have a high time complexity.