# PREPROCESSING 

Ce notebook a pour objectif de rassembler les tables et d'offrir un premier nettoyage aux données.

---
## IMPORT 

In [3]:
import os
import pandas as pd 
import numpy as np 
import pickle
from sklearn.model_selection import train_test_split

## Get Data 

In [4]:
df_answers = pd.read_csv('../data/Answers.csv', encoding='latin-1')
df_questions = pd.read_csv('../data/Questions.csv', encoding='latin-1')

In [5]:
df_train = df_answers.merge(df_questions, how='inner', left_on='ParentId', right_on='Id')
df_train = df_train.drop(['ParentId'], axis=1)

In [6]:
df_train.head()

Unnamed: 0,Id_x,OwnerUserId_x,CreationDate_x,Score_x,Body_x,Id_y,OwnerUserId_y,CreationDate_y,Score_y,Title,Body_y
0,497,50.0,2008-08-02T16:56:53Z,4,<p>open up a terminal (Applications-&gt;Utilit...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,518,153.0,2008-08-02T17:42:28Z,2,<p>I haven't been able to find anything that d...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
2,3040,457.0,2008-08-06T03:01:23Z,12,<p>Unfortunately the only API that isn't depre...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
3,195170,745.0,2008-10-12T07:02:40Z,1,<p>There must be a method in Cocoa to get a li...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
4,536,161.0,2008-08-02T18:49:07Z,9,<p>You can use ImageMagick's convert utility f...,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...


In [7]:
df_train = df_train[['Id_x', 'Body_x', 'Score_x', 'Title', 'Body_y']]
df_train.columns= ['id', 'answer', 'target', 'question', 'question_body']
df_train = df_train[['id', 'question', 'question_body', 'answer', 'target']]
str_cols = ['answer', 'question', 'question_body']

## Split data train/validation
Pour faciliter les entrainements, nous avons décidé de travailler sur 50 000 lignes de données, découpées comme suit: 60% d'entrainement, 28% de validation, 12% de test.

In [None]:
df_train = df_train.iloc[:50000]

In [13]:
X_train, X_val, y_train, y_val = train_test_split(df_train[['id', 'question', 'question_body', 'answer']], df_train['target'], test_size=0.4, random_state=0)

In [14]:
X_val, X_test, y_val, y_test = train_test_split(X_val[['id', 'question', 'question_body', 'answer']], y_val, test_size=0.3, random_state=0)

In [15]:
print('X_train shape :', X_train.shape)
print('X_val shape :', X_val.shape)
print('X_test shape :', X_test.shape)

X_train shape : (30000, 4)
X_val shape : (14000, 4)
X_test shape : (6000, 4)


In [16]:
X_train.head()

Unnamed: 0,id,question,question_body,answer
11331,431460,Restarting a Python Interpreter Quietly,<p>I have a python interpreter embedded inside...,<p>One very hacky and bug prone approach might...
43118,1553478,Most efficient way to add new keys or append t...,<p>Here's a common situation when compiling da...,"<p>Use <a href=""http://docs.python.org/3.1/lib..."
13881,514054,How to split a string by using [] in Python,"<p>So from this string:</p>\n\n<p>""name[id]""</...","<p>Yes, the delimiter is the whole string argu..."
23536,855511,referenced before assignment error in python,<p>In Python I'm getting the following error:<...,<p>I think you are using 'global' incorrectly....
33095,1187029,How do I control number formatting in the pyth...,<p>I often use the python interpreter for doin...,"<p>Hm... It's not a 100% solution, but this ha..."


## Launch preproc

Le nettoyage consiste à :
- supprimer les balises HTML des textes 
- supprimer les caractères autres que les lettres (la ponctuation)
- supprimer les stopwords

In [17]:
import sys
sys.path.append('../..')
from smartFAQ.src.cleanTxt import brief_cleaning

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
X_train = brief_cleaning(X_train, str_cols)
X_val = brief_cleaning(X_val, str_cols)
x_test = brief_cleaning(X_test, str_cols)

Time to clean up everything: 0.83 mins
Time to clean up everything: 0.38 mins
Time to clean up everything: 0.21 mins


In [24]:
X_train.head()

Unnamed: 0,id,question,question_body,answer
11331,431460,Restarting Python Interpreter Quietly,I python interpreter embedded inside applicati...,One hacky bug prone approach might c module si...
43118,1553478,Most efficient way add new keys append old key...,Here 's common situation compiling data dictio...,Use collections defaultdict default value new ...
13881,514054,How split string using Python,So string name id I need id I used str split '...,Yes delimiter whole string argument passed spl...
23536,855511,referenced assignment error python,In Python I 'm getting following error Unbound...,I think using 'global ' incorrectly See Python...
33095,1187029,How I control number formatting python interpr...,I often use python interpreter quick numerical...,Hm It 's solution come mind How defining subcl...


## Save data cleaned
On sauvegarde ces données de façon à réutiliser le même jeu de données pour pouvoir comparer les différents modèles testés

In [25]:
with open('../data/x_train.pickle', 'wb') as handle:
    pickle.dump(X_train, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../data/y_train.pickle', 'wb') as handle:
    pickle.dump(y_train, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [26]:
with open('../data/x_val.pickle', 'wb') as handle:
    pickle.dump(X_val, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../data/y_val.pickle', 'wb') as handle:
    pickle.dump(y_val, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [27]:
with open('../data/x_test.pickle', 'wb') as handle:
    pickle.dump(x_test, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../data/y_test.pickle', 'wb') as handle:
    pickle.dump(y_test, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# © Laëtitia CONSTANTIN 2021

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e80043e2-6875-4b65-a196-a0ffb97a1282' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>