# Data mining:

In this part we will clean the documents and then we use K-fold for cross-validation to get the optimal fold then we choose the best modal between SVM, KNN, Naive Bayes(Gaussian) we will get that SVM as the best modal then we use GridsearchCV on SVM to get the best parameters of this modal.

## Importing tools:
First of all we import our libraries the we will use in our projet like in the cell below:

In [509]:
import nltk
import numpy as np
import pandas as pd
from sklearn.svm import LinearSVC,SVC
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV,KFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import collections
import xml.etree.ElementTree as ET
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from os import listdir
from os.path import isfile, join
import shutil
from joblib import dump, load

### Download the required datasets:
To use NLTK (Natural Language ToolKit) we need to download the following datasets to use them for preprocessing (tokenization,stopwords,lemmatization...).

In [468]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ami\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ami\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ami\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Arranging the data folder:
we start by reading the CSV files to retrieve the files that hasn't a class yet to move them into a new folder so that the classification process will be good as the train.csv file. as a result we get **(4800 file that is in the folder data and 200 in test_set folder).** the function bellow is only executed on the first time.

## Cleaning process:
In this process firt of all we parse each XML file then we tokenize it so that it will be represented as a list in python then we lemmatize each word and then we do the stemming process to get the word stem, So the variable **cleaned_docs** will contains a list of document that we cleaned. 

In [469]:
################# docs that we are going to train on####################
df = pd.read_csv('train.csv',delimiter=",")
files=df['file']
Y=df['earnings: 0 no/ 1 yes']
cleaned_docs=[]
for f in files:
    stemmer = PorterStemmer()
    lemmatiser = WordNetLemmatizer()
    s=open("data/"+f,"r")
    dom = ET.parse(s)
    root = dom.getroot()
    doc_word=str(root.text)
    token = word_tokenize(doc_word)
    clean_tokens = []
    for word in token:
        clean_tokens.append(stemmer.stem(lemmatiser.lemmatize(word)))
    s = ' '.join(clean_tokens)
    cleaned_docs.append(s)

In [470]:
#################### docs to predict their classes######################
df = pd.read_csv('test.csv',delimiter=",")
files=df['file']
docs=[]
for f in files:
    stemmer = PorterStemmer()
    lemmatiser = WordNetLemmatizer()
    s=open("data/"+f,"r")
    dom = ET.parse(s)
    root = dom.getroot()
    doc_word=str(root.text)
    token = word_tokenize(doc_word)
    clean_tokens = []
    for word in token:
        clean_tokens.append(stemmer.stem(lemmatiser.lemmatize(word)))
    s = ' '.join(clean_tokens)
    docs.append(s)

## Features extraction (TF-IDF):
in order to get features from our text files we user this scikit function to tokenize and calculate TF-iDF of each word at the 
end the variables **X** and **docs** will contains 

a matrix **(docs-features)** of TF-IDF values such as **X** will be the our dataset with it's known classes and **docs** the dataset of docs that we will predict its classes.

**TFIDFVectorizer params:**
* **min_df=10 :** remove terms that apears in less than 10 documents.
* **max_df=0.9 :** remove terms that apears in more than 90% of documents.
* **stop_words:** takes a stop words list to remove them from documents.
* **ngram_range:** takes an interval of the number of words in a sequence.
* **lowercase:** true to turn all terms to lowercase.
* **token_pattern:** takes a regular expression to remove everything that is not alphabetic.


In [471]:
tfidf=TfidfVectorizer(stop_words=stopwords.words('english'),min_df=10,max_df=0.9,ngram_range=(1,2),lowercase=True,token_pattern=r'(?u)\b[A-Za-z]+\b'
)
tfidf.fit(cleaned_docs)
X=tfidf.transform(cleaned_docs).toarray()
docs=tfidf.transform(docs).toarray()

In [472]:
X.shape # we had 5398 features

(4800, 5398)

## Spliting training/test sets:
We split the our dataset into training set and test set and we used the default parameters so that we will have 75% (3600 docs)of training set and 25% (1200 docs) of test set.

In [448]:
X_train, X_test, y_train, y_test = train_test_split(X, Y) # split test set=> 25% and train set 75%

## Choosing the right model:
we have selected three models that we will use to determine which one is better using K-fold cross-validation. 

In [486]:
names=["svm","knn","naiveBayes"]
clfs=[SVC(C=1.0,kernel='linear'),KNeighborsClassifier(n_neighbors=5),GaussianNB()]

In [487]:
kf = KFold(n_splits=4)
kf.get_n_splits(X)
print(kf)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    for name,clf in zip(names,clfs):
        print("Model-name:"+name+" Score:"+str(clf.fit(X_train,y_train).score(X_test,y_test)))
        scores = cross_val_score(clf, X, labels, cv=4)
        print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

KFold(n_splits=4, random_state=None, shuffle=False)
TRAIN: [1200 1201 1202 ... 4797 4798 4799] TEST: [   0    1    2 ... 1197 1198 1199]
Model-name:svm Score:0.9466666666666667
Accuracy: 0.96 (+/- 0.02)
Model-name:knn Score:0.9375
Accuracy: 0.96 (+/- 0.02)
Model-name:naiveBayes Score:0.8666666666666667
Accuracy: 0.87 (+/- 0.01)
TRAIN: [   0    1    2 ... 4797 4798 4799] TEST: [1200 1201 1202 ... 2397 2398 2399]
Model-name:svm Score:0.9691666666666666
Accuracy: 0.96 (+/- 0.02)
Model-name:knn Score:0.9591666666666666
Accuracy: 0.96 (+/- 0.02)
Model-name:naiveBayes Score:0.8683333333333333
Accuracy: 0.87 (+/- 0.01)
TRAIN: [   0    1    2 ... 4797 4798 4799] TEST: [2400 2401 2402 ... 3597 3598 3599]
Model-name:svm Score:0.9716666666666667
Accuracy: 0.96 (+/- 0.02)
Model-name:knn Score:0.9683333333333334
Accuracy: 0.96 (+/- 0.02)
Model-name:naiveBayes Score:0.8783333333333333
Accuracy: 0.87 (+/- 0.01)
TRAIN: [   0    1    2 ... 3597 3598 3599] TEST: [3600 3601 3602 ... 4797 4798 4799]
Model

## GridSearchCV:
After that we saw that SVM beats all the classifier in the score and accuracy and we get the right splits the 3rd fold of the dataset which can make the classifier reach a higher score. 

In [489]:
parameters = {'C':[1, 10]}
clf=SVC(kernel="linear")
clf = GridSearchCV(clf, parameters, cv=5)

In [490]:
clf.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1, param_grid={'C': [1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

## Evaluation of the model:
We use the methods below to compute and evaluate our model so at the end we get the optimal precesion, recall and f1-measure.

In [493]:
print("score: "+str(clf.score(X_test,y_test)))

score: 0.965


In [496]:
p=clf.predict(X_test)
print(confusion_matrix(y_test,p))  
print(classification_report(y_test,p))  
print(accuracy_score(y_test, p))  

[[928  12]
 [ 30 230]]
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       940
          1       0.95      0.88      0.92       260

avg / total       0.96      0.96      0.96      1200

0.965


## Prediction of the documents:
Now we predict the labels of the given docs we will get something similar to this for each doc:

In [497]:
clf.predict(docs)

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0], dtype=int64)

In [502]:
prediction=clf.predict(docs)
collections.Counter(prediction)

Counter({0: 150, 1: 50})

## Saving our prediction in test.csv:
After we predict tests documents we save the predictions in the thirds column of the test.csv file. 

In [517]:
df = pd.read_csv('test.csv',delimiter=",")

In [518]:
df['earnings: 0 no/ 1 yes']=prediction

In [519]:
df.to_csv("test.csv", sep=',', encoding='utf-8')

## Saving the modal:
to use the classifier directly without loosing time and retraining the model we save it in a file then when we need to use it we just load it as we can see below.

In [520]:
dump(clf, 'SVM_classifier.joblib') 


['SVM_classifier.joblib']

In [521]:
clf=load('SVM_classifier.joblib') 


In [522]:
clf.predict(docs)

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0], dtype=int64)

# Data analysis:
In this part we're going to use PCA(Principal Component Analysis) to reduce the dimensionality of the datasets and then we will use K-means for clustering(unsupervised classification) then we can know if there's truly two classes or more than that.

In [528]:
pca = PCA(n_components=2)
pca.fit(X)                 
print(pca.explained_variance_ratio_)  
print(pca.singular_values_)

[0.08269211 0.04384641]
[19.51884537 14.21310989]


In [530]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

In [531]:
pca.fit(X_std)
print(pca.explained_variance_ratio_)  
print(pca.singular_values_)

[0.0061672  0.00507547]
[399.74322802 362.63954402]
