# Assignment 5 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [1]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [2]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [None]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [None]:
# word net import:

# unmark if you want to use:
# import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [3]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
!pip install hebrew_tokenizer



In [3]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

C:\Users\Roy\Desktop\assignment5-text_analysis


### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [158]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [159]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [160]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

In [124]:
# cleaning the text from english words,numbers,special characters and trialing spaces:
def cleaning_txt(s):
    s = re.sub(r'[^\w\s]', '', s)
    s = re.sub(r'\s+', ' ', s)
    s = re.sub(r'\d+', '', s)
    s = re.sub(r'[a-z]+', '' , s)
    s = re.sub(r'[A-Z]+', '' , s)
    s = s.strip()
    return s

# transforms the stories into feature vectors:
def make_Vector(df,kind,m):
    if(kind == 'c'):
        vec = CountVectorizer(min_df=m)
    if(kind == 't'):
        vec = TfidfVectorizer(min_df=m)
    df = vec.fit_transform(df)
    new = pd.DataFrame(df.toarray(), columns=vec.get_feature_names_out())
    return new


# making all the models:
def make_Models():
    models = dict()
    models["KNN"] = KNeighborsClassifier()  
    models["Decsion_Tree"] = tree.DecisionTreeClassifier()
    models["Perceptron"] = Perceptron()
    models["Naive_Bayes"] = GaussianNB()
    models["SVM"] = SGDClassifier()
    return models

# training the models and finding the best parameters that go along with them:
def Train_and_Fit(model_str,models,para):
    for m in models_str:
        clf = GridSearchCV(models[m], para[m])  
        clf.fit(X_train,y_train)
        print("Model: ",m)
        print("Best parameters: ", clf.best_params_)
        y_pred = clf.best_estimator_.predict(X_test)
        male_score = f1_score(y_test, y_pred, pos_label='m')
        female_score = f1_score(y_test, y_pred, pos_label='f')
        print("Male f1 score: " ,male_score)
        print("Female f1 score: ",female_score)
        print("Average f1 score: ",(male_score+female_score)/2)
        print("\n")

In [125]:
# YOUR CODE HERE
X = df_train["story"]
cln_X = X.copy()
y = df_train["gender"]

# cleaning the stories:
for i in range(len(X)):
    cln_X.at[i] = cleaning_txt(cln_X.at[i])

# transforming the stories to feature vectors using the 2 methods:
X_count = make_Vector(cln_X,'c',0)
X_tfdi = make_Vector(cln_X,'t',0)

# scaling the feature vectors:
scaler = MinMaxScaler()
X_count = scaler.fit_transform(X_count)
X_tfdi = scaler.fit_transform(X_tfdi)

# spliting data for training and testing:
X_train,X_test,y_train,y_test = train_test_split(X_tfdi, y, test_size=0.2, random_state=42)

# all the parameters that are going to be tested and compared upon:
para = {'KNN': {'n_neighbors': [3,5,7], 'metric': ['euclidean', 'manhattan', 'minkowski']},
        'Decsion_Tree': {'max_depth':[3,5,7] , 'min_samples_split': range(5,30,5)},
        'Perceptron': {'alpha': [0.0001, 0.05],'penalty':['l2','l1','elasticnet']},
        'SVM': {'loss': ['hinge','log_loss','modified_huber'],'max_iter': [5],'alpha': [0.0001, 0.05],'penalty':['l2','l1','elasticnet']},
        'Naive_Bayes': {}}

# results using the tfdi feature vectors:
print("Using tfdi vector:")
models_str=['Decsion_Tree','KNN', 'Naive_Bayes', 'Perceptron','SVM']
models = make_Models()
Train_and_Fit(models_str,models,para)

Using tfdi vector:
Model:  Decsion_Tree
Best parameters:  {'max_depth': 7, 'min_samples_split': 5}
Male f1 score:  0.8395061728395061
Female f1 score:  0.33898305084745767
Average f1 score:  0.5892446118434819


Model:  KNN
Best parameters:  {'metric': 'euclidean', 'n_neighbors': 3}
Male f1 score:  0.8603773584905661
Female f1 score:  0.0
Average f1 score:  0.43018867924528303


Model:  Naive_Bayes
Best parameters:  {}
Male f1 score:  0.8636363636363636
Female f1 score:  0.052631578947368425
Average f1 score:  0.45813397129186606


Model:  Perceptron
Best parameters:  {'alpha': 0.0001, 'penalty': 'l1'}
Male f1 score:  0.8774703557312253
Female f1 score:  0.3673469387755103
Average f1 score:  0.6224086472533679


Model:  SVM
Best parameters:  {'alpha': 0.0001, 'loss': 'modified_huber', 'max_iter': 5, 'penalty': 'l2'}
Male f1 score:  0.8806584362139916
Female f1 score:  0.5084745762711864
Average f1 score:  0.694566506242589




In [126]:
# results using count feature vectors:
X_train,X_test,y_train,y_test = train_test_split(X_count, y, test_size=0.2, random_state=42)
para = {'KNN': {'n_neighbors': [3,5,7], 'metric': ['euclidean', 'manhattan', 'minkowski']},
        'Decsion_Tree': {'max_depth':[3,5,7] , 'min_samples_split': range(5,30,5)},
        'Perceptron': {'alpha': [0.0001, 0.05],'penalty':['l2','l1','elasticnet']},
        'SVM': {'loss': ['hinge','log_loss','modified_huber'],'max_iter': [5],'alpha': [0.0001, 0.05],'penalty':['l2','l1','elasticnet']},
        'Naive_Bayes': {}}
print("Using count vector:")
models2_str=['Decsion_Tree','KNN', 'Naive_Bayes', 'Perceptron','SVM']
models2 = make_Models()
Train_and_Fit(models2_str,models2,para)

Using count vector:
Model:  Decsion_Tree
Best parameters:  {'max_depth': 3, 'min_samples_split': 20}
Male f1 score:  0.8671875
Female f1 score:  0.26086956521739135
Average f1 score:  0.5640285326086957


Model:  KNN
Best parameters:  {'metric': 'manhattan', 'n_neighbors': 3}
Male f1 score:  0.8603773584905661
Female f1 score:  0.0
Average f1 score:  0.43018867924528303


Model:  Naive_Bayes
Best parameters:  {}
Male f1 score:  0.8636363636363636
Female f1 score:  0.052631578947368425
Average f1 score:  0.45813397129186606


Model:  Perceptron
Best parameters:  {'alpha': 0.0001, 'penalty': 'l1'}
Male f1 score:  0.8709677419354839
Female f1 score:  0.40740740740740744
Average f1 score:  0.6391875746714457


Model:  SVM
Best parameters:  {'alpha': 0.0001, 'loss': 'modified_huber', 'max_iter': 5, 'penalty': 'l2'}
Male f1 score:  0.874015748031496
Female f1 score:  0.33333333333333337
Average f1 score:  0.6036745406824147




In [136]:
# changing the min_df variable in order to have less data and ignore typos:
X_tfdi = make_Vector(cln_X,'t',2)
scaler = MinMaxScaler()
X_tfdi = scaler.fit_transform(X_tfdi)
X_train,X_test,y_train,y_test = train_test_split(X_tfdi, y, test_size=0.2, random_state=42)

print("Using tfdi vector with min_df = 2:")
models_str=['Decsion_Tree','KNN', 'Naive_Bayes', 'Perceptron','SVM']
models = make_Models()
Train_and_Fit(models_str,models,para)

    

Using tfdi vector with min_df = 2:
Model:  Decsion_Tree
Best parameters:  {'max_depth': 3, 'min_samples_split': 15}
Male f1 score:  0.8616600790513833
Female f1 score:  0.28571428571428575
Average f1 score:  0.5736871823828346


Model:  KNN
Best parameters:  {'metric': 'euclidean', 'n_neighbors': 5}
Male f1 score:  0.8603773584905661
Female f1 score:  0.0
Average f1 score:  0.43018867924528303


Model:  Naive_Bayes
Best parameters:  {}
Male f1 score:  0.8636363636363636
Female f1 score:  0.052631578947368425
Average f1 score:  0.45813397129186606


Model:  Perceptron
Best parameters:  {'alpha': 0.0001, 'penalty': 'l1'}
Male f1 score:  0.8852459016393444
Female f1 score:  0.5172413793103449
Average f1 score:  0.7012436404748446


Model:  SVM
Best parameters:  {'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 5, 'penalty': 'l1'}
Male f1 score:  0.8632478632478633
Female f1 score:  0.5294117647058824
Average f1 score:  0.6963298139768728




In [137]:
X_tfdi = make_Vector(cln_X,'c',50)
scaler = MinMaxScaler()
X_tfdi = scaler.fit_transform(X_tfdi)
X_train,X_test,y_train,y_test = train_test_split(X_tfdi, y, test_size=0.2, random_state=42)

print("Using count vector with min_df = 50:")
models_str=['Decsion_Tree','KNN', 'Naive_Bayes', 'Perceptron','SVM']
models = make_Models()
Train_and_Fit(models_str,models,para)

Using count vector with min_df = 50:
Model:  Decsion_Tree
Best parameters:  {'max_depth': 3, 'min_samples_split': 20}
Male f1 score:  0.8593155893536121
Female f1 score:  0.05128205128205129
Average f1 score:  0.4552988203178317


Model:  KNN
Best parameters:  {'metric': 'euclidean', 'n_neighbors': 5}
Male f1 score:  0.8593155893536121
Female f1 score:  0.05128205128205129
Average f1 score:  0.4552988203178317


Model:  Naive_Bayes
Best parameters:  {}
Male f1 score:  0.8251121076233183
Female f1 score:  0.5063291139240507
Average f1 score:  0.6657206107736845


Model:  Perceptron
Best parameters:  {'alpha': 0.0001, 'penalty': 'l1'}
Male f1 score:  0.8634361233480176
Female f1 score:  0.5866666666666667
Average f1 score:  0.7250513950073421


Model:  SVM
Best parameters:  {'alpha': 0.05, 'loss': 'modified_huber', 'max_iter': 5, 'penalty': 'l2'}
Male f1 score:  0.8629032258064516
Female f1 score:  0.37037037037037035
Average f1 score:  0.616636798088411




In [185]:
#using the best model (perceptron) and prdeicting the test:
combo = pd.concat([df_train["story"], df_test["story"]] , ignore_index=True)
for i in range(len(X)):
    combo.at[i] = cleaning_txt(combo.at[i])
combo = make_Vector(combo,'c',50)
X_train = combo.iloc[:753,:]
X_test = combo.iloc[753:1076,:]
y_train = df_train["gender"]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
clf = Perceptron(alpha=0.0001,penalty='l1')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
result = pd.DataFrame({"test_example_id" : df_test["test_example_id"] , "predicted_category": y_pred})
result.head(5)
result.tail(5)


Perceptron(penalty='l1')

Unnamed: 0,test_example_id,predicted_category
0,0,m
1,1,m
2,2,m
3,3,m
4,4,m


Unnamed: 0,test_example_id,predicted_category
318,318,m
319,319,m
320,320,m
321,321,m
322,322,m


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [187]:
result.to_csv('classification_results.csv',index=False)