
<a href="http://www.inokufu.com"><img src = "http://www.inokufu.com/wp-content/uploads/elementor/thumbs/logo_inokufu_vector_full-black-om2hmu9ob1jytetxemkj1ij8g7tt3hzrtssivh2fl2.png" width = 400> </a>


<h1 align=center><font size = 5>Exploratory Data Analysis : Titre</font></h1>

## Introduction

In this notebook, we conduct an Exploratory Data Analysis (EDA). The idea is to better understand which kind of classifier would be the greatest one with the data we processed in part 1, and the model we created in part 2. 

Our EDA approach follows the **Data Science Methodology CRISP-DM**. For more info about this approach, check this [Wikipedia page](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Data Collection</a>

2. <a href="#item2">Creation of models</a>

3. <a href="#item3">Conclusion</a>    

</font>
</div>
<a id='the_destination'></a>

In [None]:
import numpy as np 
np.set_printoptions(threshold=10000,suppress=True) 
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.image as img
from matplotlib import rcParams

import json
import unicodedata

import seaborn as sns
from cycler import cycler

from bs4 import BeautifulSoup

import spacy
import nltk
import os
import string
import numpy as np
import copy
import pandas as pd
import pickle
import re
import math
import gensim
import time 
import multiprocess
import multiprocessing

import fr_core_news_sm

from spacy_langdetect import LanguageDetector

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

from collections import Counter

from gensim.models import Word2Vec
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.feature_extraction import stop_words
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score,cross_val_predict, KFold
from sklearn.metrics import cohen_kappa_score

from textblob import Blobber
from textblob_fr import PatternTagger, PatternAnalyzer

print('Libraries imported.')

## 1. Data Import <a id='item1'></a>

In [None]:
path = './data/20200330_Processed_Data/'
file_name = '20200330_Processed_Udemy_Json'
file_extension = '.csv'

df = pd.read_csv(path + file_name + file_extension, sep=';')
df = df.drop('Unnamed: 0', axis=1)
df_json = df.reset_index(drop=True)
df_json.head(3)

In [None]:
# Splitting data into 

corpus = df_json['description']
title = df_json['title']
Y = df_json['rating_01'].astype(int)

In [None]:
corpus = corpus.apply(lambda line: gensim.utils.simple_preprocess(str(line)))
title = title.apply(lambda line: gensim.utils.simple_preprocess(str(line)))

## 2. Models Import <a id='item2'></a>

In [None]:
# Importing the 4 models we trained in the part 2 : Word2Vec

model_100_5 = gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_100_5000_Train25_Iter10.model")
model_100_10 = gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_100_10000_Train25_Iter10.model")
model_300_5 = gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_300_5000_Train25_Iter10.model")
model_300_10 = gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_300_10000_Train25_Iter10.model")

model_to_send = gensim.models.Word2Vec.load("./models/Word2Vec_model_to_send.model")

## 3. Data Transformation <a id='item3'></a>

In [None]:
# Getting word2vec vector for each sentences : AVERAGE word Embedding
# The variable type represents the type of the embedding : 'average' or 'sum'

def word2vec_Embedding(reviews_unigram, model_, model_size_, type_):
    
    # If the type doesn't exist, quit the function 
    if type_ != 'average':
        if type_ != 'sum':
            print("Wrong type selected")
            return
    
    dict_word2vec = {}
    for index, word_list in enumerate(reviews_unigram):
        arr = np.array([0.0 for i in range(0, model_size_)])
        for word in word_list:
            try:
                arr += model_.wv[word]
            except KeyError:
                continue
                
        if type_ == 'average':
            # Doing the mean of vectors :
            if(len(word_list) == 0):
                dict_word2vec[index] = arr
            else:
                dict_word2vec[index] = arr / len(word_list)
        elif type_ == 'sum':
            # Doing the sum of vectors :
            dict_word2vec[index] = arr
        
    df_word2vec = pd.DataFrame(dict_word2vec).T
    return df_word2vec

In [None]:
# Converting each sentence to a vector using Word2Vec and word embedding : Average and Sum

# Trying on this model : 
model = model_300_5
model_size = 300

df_word2vec = word2vec_Embedding(corpus, model, model_size, 'average');

In [None]:
df_word2vec.head(3)

## 4. Classification <a id='item4'></a>

In [None]:
# Instanciation des classifieurs
clfs = {
    'RF': RandomForestClassifier(n_estimators = 500, random_state = 0),
    #'ADA': AdaBoostClassifier(n_estimators=50),
    #'BAG': BaggingClassifier(n_estimators=50),
    #'KNN': KNeighborsClassifier(n_neighbors=5),
    #'NB': GaussianNB(),
    #'MLP': MLPClassifier(hidden_layer_sizes=(20, 10), alpha=0.001, max_iter=200),
    #'CART': DecisionTreeClassifier(criterion='gini'),
    #'ID3': DecisionTreeClassifier(criterion='entropy'),
    #'ST': DecisionTreeClassifier(max_depth=1)
}

In [None]:
def run_classifiers (X,Y,clfs):
    
    kf = KFold(n_splits=10, shuffle=True, random_state=0)
    
    for clf_name in clfs:
        clf = clfs[clf_name]
        
        begin = time.time()
        cv_acc = cross_val_score(clf, X, Y, cv=kf)
        end = time.time()
        print("Accuracy for {0} is: {1:.3f} +/- {2:.3f} (in {3:.2f} seconds)".format(clf_name, np.mean(cv_acc), np.std(cv_acc), end-begin))
        
        cv_precision = cross_val_score(clf, X, Y, cv=kf, scoring='precision') 
        print("Precision for {0} is: {1:.3f} +/- {2:.3f}".format(clf_name, np.mean(cv_precision), np.std(cv_precision)))
        
        cv_recall = cross_val_score(clf, X, Y, cv=kf, scoring='recall')
        print("Recall for {0} is: {1:.3f} +/- {2:.3f}".format(clf_name, np.mean(cv_recall), np.std(cv_recall)))
        
        cv_auc = cross_val_score(clf, X, Y, cv=kf, scoring='roc_auc') 
        print("AUC for {0} is: {1:.3f} +/- {2:.3f}".format(clf_name, np.mean(cv_auc), np.std(cv_auc)))
        
        Y_pred = cross_val_predict(clf, X, Y, cv=kf)
        
        cohen_kappa = cohen_kappa_score(Y,Y_pred)
        print("Cohen-Kappa for {0} is: {1:.3f}".format(clf_name, cohen_kappa))
        
        conf_mat = confusion_matrix(Y, Y_pred)
        print(conf_mat)

In [None]:
run_classifiers(df_word2vec, Y, clfs)

### Add variables : price_detail and num_sub 

In [None]:
# Concat df_word2vec avec les nouvelles variables 

df_final = pd.concat([df_word2vec,df_json['price_detail'],df_json['num_sub']],axis=1)

In [None]:
df_final.head(3)

### Run classifiers

In [None]:
run_classifiers(df_final, Y, clfs)

## 5. Normalisation + ACP <a id='item5'></a>

In [None]:
X = df_final

### Normalisation

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X)
X_norm = sc.transform(X)

### Ajout d'une ACP

In [None]:
# Test : le but est de savoir la meilleure quantité de colonnes à conserver pour que l'ACP soit efficace
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_norm)

n = len(X_norm)
p = len(df_final.columns)

eigval = (n-1)/n * pca.explained_variance_

plt.plot(np.arange(1,p+1),eigval) 
plt.title("Scree plot") 
plt.ylabel("Eigen values") 
plt.xlabel("Factor number") 
plt.show()

In [None]:
pca = PCA()
pca.fit(X_norm)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

In [None]:
pca = PCA(n_components = 6)
pca.fit(X_norm)

X_pca = np.concatenate((X_norm, pca.transform(X_norm)), axis=1) 

In [None]:
run_classifiers(X_pca, Y, clfs)

## 6. Next Steps <a id='item6'></a>

What we have done so far is quite good : we actually have a model with 67% accuracy and 69% precision, which is not that bad compared to the naive classifier (which has 50% accuracy).   

For next steps, we could :
- try other word2vec models with different parameters
- try to change the number of columns we keep from PCA (actually 17)
- try other classifiers with other params : we tried many classifiers, but with no optimisation on param's choices
- try Bert, LSTM


<hr>

Author [Guillaume Lefebvre](https://www.linkedin.com/in/guillaume-lefebvre-22117610b/) - For more information, contact us at contact@inokufu.com - Copyright &copy; 2020 [Inokufu](http://www.inokufu.com)

<a href="http://www.inokufu.com"><img src = "http://www.inokufu.com/wp-content/uploads/elementor/thumbs/logo_inokufu_vector_full-black-om2hmu9ob1jytetxemkj1ij8g7tt3hzrtssivh2fl2.png" width = 400> </a>


