# PREPROCESSING 
---
## IMPORT 

In [1]:
import os
import pandas as pd 
import numpy as np 
import pickle
from sklearn.model_selection import train_test_split

## Get Data 

In [2]:
df_answers = pd.read_csv('../data/Answers.csv', encoding='latin-1')
df_questions = pd.read_csv('../data/Questions.csv', encoding='latin-1')

In [3]:
df_train = df_answers.merge(df_questions, how='inner', left_on='ParentId', right_on='Id')
df_train = df_train.drop(['ParentId'], axis=1)

In [4]:
df_train.head()

Unnamed: 0,Id_x,OwnerUserId_x,CreationDate_x,Score_x,Body_x,Id_y,OwnerUserId_y,CreationDate_y,Score_y,Title,Body_y
0,497,50.0,2008-08-02T16:56:53Z,4,<p>open up a terminal (Applications-&gt;Utilit...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,518,153.0,2008-08-02T17:42:28Z,2,<p>I haven't been able to find anything that d...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
2,3040,457.0,2008-08-06T03:01:23Z,12,<p>Unfortunately the only API that isn't depre...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
3,195170,745.0,2008-10-12T07:02:40Z,1,<p>There must be a method in Cocoa to get a li...,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
4,536,161.0,2008-08-02T18:49:07Z,9,<p>You can use ImageMagick's convert utility f...,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...


In [5]:
df_train = df_train[['Id_x', 'Body_x', 'Score_x', 'Title', 'Body_y']]
df_train.columns= ['id', 'answer', 'target', 'question', 'question_body']
df_train = df_train[['id', 'question', 'question_body', 'answer', 'target']]
str_cols = ['answer', 'question', 'question_body']

## Split data train/validation

In [6]:
X_train, X_val, y_train, y_val = train_test_split(df_train[['id', 'question', 'question_body', 'answer']], df_train['target'], test_size=0.2, random_state=0)

In [7]:
X_train = X_train.iloc[:10000]
X_val = X_val.iloc[:10000]
y_train = y_train.iloc[:10000]
y_val = y_val.iloc[:10000]

In [8]:
X_train.shape

(10000, 4)

In [9]:
X_train.head()

Unnamed: 0,id,question,question_body,answer
490144,21032057,Deleting multiple indexes from a list at once ...,<p>My problem is I have a list\neg. </p>\n\n...,<p>You can use a list comprehension to rebuild...
815071,34022443,Django Grappelli_Nested Inlines cannot create ...,<p>I am looking for a way to create a new nest...,<p>What i did was to take the templates that d...
62146,2185875,Expanding elements in a list,"<p>I'm looking for a ""nice"" way to process a l...",<p>The last one is probably your most pythonic...
304024,12626260,UnicodeDecodeError while processing filenames,<p>I'm using Python 2.7.3 on Ubuntu 12 x64.</p...,<p>Your filenames are byte strings that contai...
187856,7288407,Python string manipulation -- performance prob...,<p>I have the following piece of code that I e...,<p>EDIT: I'm changing this answer a bit. I'll...


## Launch preproc
brief_cleaning : remove HTML code, characters other than letters and stop words

In [10]:
import sys
sys.path.append('../..')
from smartFAQ.src.cleanTxt import brief_cleaning

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
X_train = brief_cleaning(X_train, str_cols)
X_val = brief_cleaning(X_val, str_cols)

Time to clean up everything: 0.32 mins
Time to clean up everything: 0.33 mins


In [12]:
X_train.head()

Unnamed: 0,id,question,question_body,answer
490144,21032057,Deleting multiple indexes list python,My problem I list eg lst lst pop lst pop Becau...,You use list comprehension rebuild list indice...
815071,34022443,Django Grappelli Nested Inlines create new nes...,I looking way create new nested row saving own...,What take templates django grappelli inline pr...
62146,2185875,Expanding elements list,I 'm looking nice way process list elements ne...,The last one probably pythonic could try impli...
304024,12626260,UnicodeDecodeError processing filenames,I 'm using Python Ubuntu x I files folder file...,Your filenames byte strings contain UTF bytes ...
187856,7288407,Python string manipulation performance problems,I following piece code I execute around millio...,EDIT I 'm changing answer bit I 'll leave orig...


## Save data cleaned

In [13]:
with open('../data/x_train.pickle', 'wb') as handle:
    pickle.dump(X_train, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../data/y_train.pickle', 'wb') as handle:
    pickle.dump(y_train, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [14]:
with open('../data/x_val.pickle', 'wb') as handle:
    pickle.dump(X_val, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('../data/y_val.pickle', 'wb') as handle:
    pickle.dump(y_val, handle, protocol=pickle.HIGHEST_PROTOCOL)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e80043e2-6875-4b65-a196-a0ffb97a1282' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>