# Assignment 5 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [1]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.feature_selection import SelectKBest, chi2 ,mutual_info_classif
from sklearn.model_selection import  cross_validate ,RepeatedKFold , RepeatedStratifiedKFold
from sklearn.ensemble import StackingClassifier , BaggingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC 
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [2]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [3]:
# word net installation:

# unmark if you want to use and need to install
#!pip install wn
#!python -m wn download omw-he:1.4

In [4]:
# word net import:

# unmark if you want to use:
#import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [5]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
#!pip install hebrew_tokenizer

In [6]:
# Hebrew tokenizer import:

# unmark if you want to use:
#import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [7]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [8]:
df_train.head(8)
df_train.shape
df_train.tail(60)

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

Unnamed: 0,story,gender
693,"שלום לכולם, בשנה הסוערת האחרונה בחרתי להציע לא...",m
694,בסמסטר קודם אני וכמה חברים ללימודים קבענו להיפ...,m
695,אחרי 7 שנים החלטתי והרגשתי שהגיע הזמן למסד את ...,m
696,"אני גולש קייט כבר 3 שנים, הספורט הזה תמיד עניי...",m
697,ביום שבו נולדתי התברר לצוות הרפואי שערך בדיקות...,m
698,אחרי הלימודים ביום שני נסעתי עם בת הזוג שלי בה...,m
699,לפני שנתיים כשהיתתי בן 24 נסעתי עם האוטו שלי ל...,m
700,בתחילת הקורנה באזור הסגר הראשון אני ובת הזוג ש...,m
701,"יום חמישי רגיל, תמיד מתחיל לעבור טיפה מאוחר יו...",m
702,"בגלל כל המצב של הקורונה בארץ, הרבה זמן היינו ב...",m


In [9]:
df_test.head(3)
df_test.shape

df_test

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...
3,3,"רגע הגיוס לצבא היה הרגע הכי משמעותי עבורי, אני..."
4,4,אני הגעתי לברזיל ישר מקולומביה וגם אני עשיתי ע...
...,...,...
318,318,"בשנה האחרונה הרגשתי די תקוע בעבודה, השגרה הפכה..."
319,319,אני ואילן חברים טובים מזה 20 שנה תמיד חלמנו לפ...
320,320,מידי יום שישי אני נוהג לנסוע בתחבורה ציבורית ס...
321,321,"לפני מספר חודשים, בשיא התחלואה של הגל השני, עמ..."


### Your implementation:
Write your code solution in the following code-cells

Students Names: Tal Damari and Adar Azulay.


# ---------------------------------------------------------------------------------------------

df_clean_text function is cleaning our text from numeric numbers, dots etc.
custom_preprocessor is removing non hebrew characters

In [10]:
def df_clean_text(df_series):
    for indx in df_series.index:
        df_series["story"][indx] = re.sub(r'\d+', '', df_series["story"][indx])
        df_series["story"][indx] = re.sub(r'[^\w\s]', '', df_series["story"][indx])
        df_series["story"][indx] = re.sub(r'\s+', ' ', df_series["story"][indx])
        df_series["story"][indx] = df_series["story"][indx].strip()
    return df_series

In [11]:
df_train = df_clean_text(df_train)
df_test = df_clean_text(df_test)

# ---------------------------------------------------------------------------------------------

The next function is making train test split to our data set, after we "cleaned" it 

In [12]:
def train_test_split_func(df):
    X_train, X_test, y_train, y_test = train_test_split(df["story"], df["gender"], test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

In [13]:
X_train, X_test, y_train, y_test = train_test_split_func(df_train)

# --------------------------------------------------------------------------------------------- 

Creating a function that makes a preprocessing to our dataset that makes it to be a TF-IDF Vectorizer

In [14]:
def TfidfVectorizer_func(X_train, X_test, ngram_range = (1,1), min_df = 5):
    #df_train_copy = df_train["story"]
    count_idf = TfidfVectorizer(min_df = min_df, ngram_range=ngram_range) #creating an object idfvectorizer
    X_train_idf = count_idf.fit_transform(X_train)
    X_test_idf = count_idf.transform(X_test)
    return X_train_idf, X_test_idf # return the new dataset with the idf values for each word

# --------------------------------------------------------------------------------------------- 

Preform a selector for our dataset that will make feature selection in our dataset 

In [15]:
def select_best_features(X_train, X_test, y_train, k=1000):
    select_best = SelectKBest(mutual_info_classif, k=k)
    select_best.fit(X_train, y_train)
    
    X_train_select = select_best.transform(X_train)
    X_test_select = select_best.transform(X_test)
    return X_train_select, X_test_select

# --------------------------------------------------------------------------------------------- 

The next function is using MinMaxScaler, to scale our dataset:

In [16]:
def preform_MinMaxScale(X_train_select, X_test_select):
    scale = MinMaxScaler()
    X_train_scale = scale.fit_transform(X_train_select.toarray())
    X_test_scale = scale.fit_transform(X_test_select.toarray())
    return X_train_scale, X_test_scale 

In [17]:
X_train_idf, X_test_idf = TfidfVectorizer_func(X_train, X_test)
X_train_select, X_test_select = select_best_features(X_train_idf, X_test_idf, y_train)
X_train_scale, X_test_scale = preform_MinMaxScale(X_train_select, X_test_select)

# ---------------------------------------------------------------------------------------------

Finding the best parameters for our model using GridSearchCV function, they will be implemented in different functions:

In [18]:
def find_best_params_grid_search(model_clf, model_parameters, X_train, y_train):
    grid_search = GridSearchCV(model_clf, model_parameters, cv = 5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

Implementation to get the best parameters for each model we will use:

In [19]:
Linear_svc_parmas = [{'C': [0.01, 0.1, 1, 10, 100], 'penalty': [None, 'l1', 'l2'], 'dual': [False]}]
best_params_linear_svc = find_best_params_grid_search(LinearSVC(), Linear_svc_parmas, X_train_scale, y_train)
best_params_linear_svc

Perceptron_params = [{'alpha': [0.0001, 0.05], 'penalty': [None, 'l2', 'l1', 'elasticnet']}]
best_params_perceptron = find_best_params_grid_search(Perceptron(), Perceptron_params, X_train_scale, y_train)
best_params_perceptron

MultinomailNB_parmas = [{'alpha': [0.01, 0.1, 0.5, 1], 'fit_prior': [True, False], 'class_prior': [None, [0.5, 0.5], [0.3, 0.7]]}]
best_params_MultinomialNB = find_best_params_grid_search(MultinomialNB(), MultinomailNB_parmas, X_train_scale, y_train)
best_params_MultinomialNB

MLP_params = [{'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)], 'activation': ['tanh', 'relu'],
               'solver': ['sgd', 'adam'], 'alpha': [0.0001, 0.05], 'learning_rate': ['constant','adaptive']}]
best_params_MLP = find_best_params_grid_search(MLPClassifier(),MLP_params, X_train_scale, y_train)
best_params_MLP

SGD_parmas = [{'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
               'penalty': ['l2', 'l1', 'elasticnet'],
               'alpha': [0.0001, 0.05]}]
best_params_SGD = find_best_params_grid_search(SGDClassifier(), SGD_parmas, X_train_scale, y_train)
best_params_SGD

{'C': 1, 'dual': False, 'penalty': 'l2'}

{'alpha': 0.0001, 'penalty': None}

{'alpha': 1, 'class_prior': None, 'fit_prior': True}

{'activation': 'tanh',
 'alpha': 0.05,
 'hidden_layer_sizes': (100,),
 'learning_rate': 'adaptive',
 'solver': 'adam'}

{'alpha': 0.0001, 'loss': 'squared_hinge', 'penalty': 'elasticnet'}

# --------------------------------------------------------------------------------------------- 

**Setting the models were going to use with the best parameters the grid search has given us**

In [20]:
Linear_model = LinearSVC(**best_params_linear_svc)  #linear model
Perceptron_model = Perceptron(**best_params_perceptron) ##perceptron model
MultinomialNB_model = MultinomialNB(**best_params_MultinomialNB) #Multinomial naive bayse model
MLP_model = MLPClassifier(**best_params_MLP) #MLP model 
SGD_model = SGDClassifier(**best_params_SGD) #SGD model

# ---------------------------------------------------------------------------------------------

The next function is fitting and predicting the values that the model is giving us, and returning us the model with the f1_score as being asked f1_average:

In [21]:
def fit_predict_evaluate(model_name, X_train, X_test, y_train, y_test):
    model_trained = model_name.fit(X_train, y_train) #fitting our model with the train values
    
    cv = RepeatedKFold(n_splits= 5, n_repeats=3, random_state=42) #using different amount of folds for cross_vald_score function
    f1_scorer = make_scorer(f1_score, average='micro')
    scores = cross_val_score(model_trained, X_test, y_test, scoring = f1_scorer, cv = cv, n_jobs=-1)
    
    scores 
    
    y_pred = model_trained.predict(X_test)
    print(y_pred)
    
    f1_male = f1_score(y_test, y_pred, pos_label="m")
    f1_female = f1_score(y_test, y_pred, pos_label="f")
    f1_average = (f1_male+f1_female)/2
    
    return model_trained, f1_average

**Next cell is using the above cell, and showing our model predictions correctioness using f1_score_average as being asked:**

In [22]:
Linear_model
linear_trained, f1_average_linear = fit_predict_evaluate(Linear_model, X_train_scale, X_test_scale, y_train, y_test)
print("f1_score Linear_svc:")
print(f1_average_linear)

Perceptron_model
perceptorn_trained, f1_average_perceptron = fit_predict_evaluate(Perceptron_model, X_train_scale, X_test_scale, y_train, y_test)
print("f1_score perceptron:")
print(f1_average_perceptron)

MultinomialNB_model
MultinomialNB_trained, f1_average_MultinomialNB = fit_predict_evaluate(MultinomialNB_model, X_train_scale, X_test_scale, y_train, y_test)
print("f1_score MultinomialNB:")
print(f1_average_MultinomialNB)

MLP_model
MLP_trained, f1_average_MLP = fit_predict_evaluate(MLP_model, X_train_scale, X_test_scale, y_train, y_test)
print("f1_score MLP:")
print(f1_average_MLP)

SGD_model
SGD_trained, f1_average_SGD = fit_predict_evaluate(SGD_model, X_train_scale, X_test_scale, y_train, y_test)
print("f1_score SGD:")
print(f1_average_SGD)


LinearSVC(C=1, dual=False)

['m' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm'
 'f' 'f' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'f']
f1_score Linear_svc:
0.7082125603864735


Perceptron()

['m' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'f' 'f' 'm' 'm' 'm' 'f' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'm' 'm' 'm' 'f' 'f' 'f' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'f' 'm'
 'f' 'm' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'f']
f1_score perceptron:
0.7010945505356312


MultinomialNB(alpha=1)

['m' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm'
 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'm']
f1_score MultinomialNB:
0.694566506242589


MLPClassifier(activation='tanh', alpha=0.05, learning_rate='adaptive')

['m' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'f']
f1_score MLP:
0.6955645161290323


SGDClassifier(loss='squared_hinge', penalty='elasticnet')

['m' 'm' 'm' 'f' 'f' 'm' 'm' 'f' 'f' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'm' 'f' 'm' 'f' 'f'
 'f' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'f' 'f' 'f' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm'
 'f' 'f' 'f' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'm' 'f'
 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm']
f1_score SGD:
0.7073643410852712


# ---------------------------------------------------------------------------------------------

Next cells are for the df_test to predict and to finish the Assignment :-)

In [23]:
def TfidfVectorizer_func_test(X, ngram_range = (1,1), min_df = 5):
    count_idf = TfidfVectorizer(min_df = min_df, ngram_range=ngram_range) #creating an object idfvectorizer
    X = count_idf.fit_transform(X)
    return X # return the new dataset with the idf values for each word

In [24]:
def select_best_features_test(X, k=1000):
    select_best = SelectKBest(mutual_info_classif, k=k)
    X_test_selected = select_best.fit_transform(X, np.zeros(X.shape[0]))
    return X_test_selected

In [25]:
def preform_MinMaxScale_test(X_train_select):
    scale = MinMaxScaler()
    X_train_scale = scale.fit_transform(X_train_select.toarray())
    return X_train_scale

In [26]:
X_df_test = df_test["story"]  #X_df_test has only the stories 
X_df_test = TfidfVectorizer_func_test(X_df_test) #using TF - IDF vectorizer
X_df_test = select_best_features_test(X_df_test) #feature selector for the k best features of the dataset
X_df_test = preform_MinMaxScale_test(X_df_test) #minmax scaling 

y_prediction_test = SGD_trained.predict(X_df_test)  #predicting on the df_test
y_prediction_test #printing the predictions 

df_text_exmp = df_test.test_example_id #creating an array with the text example id column in df test

df_predicted = pd.DataFrame({"test_example_id":df_text_exmp.tolist(),
                     "predicted_category": y_prediction_test.tolist()}) #creating a dataframe as wanted 

print("Test only:")
df_predicted

array(['m', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'f', 'f', 'm',
       'f', 'm', 'm', 'f', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'f', 'm', 'm', 'f', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'f',
       'm', 'm', 'f', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'f', 'f', 'm',
       'm', 'm', 'm', 'm', 'm', 'f', 'm', 'f', 'm', 'f', 'm', 'f', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'm', 'm', 'f', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f',
       'm', 'f', 'm', 'm', 'm', 'f', 'f', 'f', 'm', 'm', 'm', 'm', 'f',
       'm', 'm', 'm', 'f', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'f', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'f', 'm', 'm', 'm

Test only:


Unnamed: 0,test_example_id,predicted_category
0,0,m
1,1,m
2,2,m
3,3,m
4,4,m
...,...,...
318,318,m
319,319,f
320,320,f
321,321,m


# ---------------------------------------------------------------------------------------------

### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [28]:
df_predicted.to_csv('classification_results.csv',index=False)