#### Importing Libraries

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

------------------

#### Question 1:  Assigning Class Labels. For this assignment, instead of classifying the language of the text, we are going to predict whether the text contains a concrete word or not. Create a new column named ‘CONCRETE’ in your dataframe. The values for this column should be: LOW: if value in MEAN column is <=4; HIGH: if value in MEAN column >4


In [46]:
english = pd.read_csv("CONcreTEXT_trial_EN.tsv", sep='\t')
italian = pd.read_csv("CONcreTEXT_trial_IT.tsv", sep='\t')
combined = pd.concat([english, italian])
combined["CONCRETE"] = np.where(combined['MEAN'] <= 4, 'LOW',  'HIGH')
combined

Unnamed: 0,TARGET,POS,INDEX,TEXT,MEAN,CONCRETE
0,achievement,N,3,"Bring up academic achievements , awards , and ...",3.06,LOW
1,achievement,N,9,"Please list people you have helped , your pers...",3.03,LOW
2,activate,V,1,Add activated carbon straight to your vodka .,3.83,LOW
3,activate,V,15,"Place sensors around your garden , and when a ...",5.51,HIGH
4,adventure,N,9,Look for a partner that shares your level of a...,2.03,LOW
...,...,...,...,...,...,...
95,verità,N,8,"In un modo o nell' altro , la verità viene sem...",2.53,LOW
96,viaggio,N,2,Organizza dei viaggi nel fine settimana quando...,5.03,HIGH
97,viaggio,N,6,Pesa le tue valigie prima del viaggio per evit...,4.84,HIGH
98,vista,N,6,è molto importante non perdere di vista la pro...,2.22,LOW


------------------

#### Question 2:  Train Test split. You should use the sklearn library train_test_split function to create an 80% training set and 20% testing set. No need to create a validation/dev set for this assignment. 

In [47]:
combined_train, combined_test = train_test_split(combined, test_size = 0.2, random_state = 0)

------------------

#### Question 3: Majority Class Baseline. We will create a majority class baseline to evaluate our initial model performance – which is the simplest baseline. The label for each instance in the test data should simply be the majority class found in training data. You should report P, R and F-score and accuracy of this majority baseline model on the test set . 

In [48]:
predicted = ['HIGH'] * len(combined_test)
print(metrics.classification_report(combined_test['CONCRETE'],predicted))

              precision    recall  f1-score   support

        HIGH       0.62      1.00      0.77        25
         LOW       0.00      0.00      0.00        15

    accuracy                           0.62        40
   macro avg       0.31      0.50      0.38        40
weighted avg       0.39      0.62      0.48        40



  _warn_prf(average, modifier, msg_start, len(result))


------------------

#### Question 4: Target Length Baseline. We will create another baseline to evaluate our model performance – which takes into account length of the word (in characters) in the ‘TARGET’ column. For this question, let us assume that all words with length >= 5 characters can be classified as HIGH CONCRETE (This assumption may not actually be true, but it is an assumption we are making for the purposes of creating a baseline). The label for each instance in the test data should simply be the HIGH if the word in the ‘TARGET’ column has length >=5 characters and LOW otherwise. You should report P, R and F-score and accuracy of this target length baseline model on the test set.


In [49]:
combined_train_target=np.where((combined_train['TARGET']).str.len() >=5, 'HIGH','LOW')
combined_test_target=np.where((combined_test['TARGET']).str.len() >=5, 'HIGH','LOW')
print(metrics.classification_report(combined_test['CONCRETE'],combined_test_target))

              precision    recall  f1-score   support

        HIGH       0.64      0.72      0.68        25
         LOW       0.42      0.33      0.37        15

    accuracy                           0.57        40
   macro avg       0.53      0.53      0.52        40
weighted avg       0.56      0.57      0.56        40



------------------

#### Question 5: Naive Bayes Classifier. You can follow the same steps as the previous assignment and use the built-in Naive Bayes model from sklearn to train a classifier for predicting whether the text is HIGH concrete or LOW concrete (instead of what is the language of the text, which was the classification task for the previous assignment).

In [50]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(combined_train["TEXT"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, combined_train["CONCRETE"])

In [51]:
X_new_counts = count_vect.transform(combined_test["TARGET"])
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

In [52]:
print(metrics.classification_report(combined_test['CONCRETE'],predicted))

              precision    recall  f1-score   support

        HIGH       0.62      1.00      0.77        25
         LOW       0.00      0.00      0.00        15

    accuracy                           0.62        40
   macro avg       0.31      0.50      0.38        40
weighted avg       0.39      0.62      0.48        40



  _warn_prf(average, modifier, msg_start, len(result))


------------------

#### Question 6: Comparing Performance. How does the Naive Bayes classifier perform, when compared against the performance of the two baselines developed in Questions 3 and 4? What are your observations? Write at least 50 words of your own original thoughts describing what you observe when comparing performance of these three models.

The Naive Bayes classifier performs better than the other two baselines developed above because all the Recall, precision and f1-score and support values are higher when compared to the other classifiers. This is because Naive Bayes Classifier is classifying without modifying the criteria used in label categorization. When using Majoring Class Basline we are choosing the label which has high or low values, and considering only high concrete value. When considering target length baseline model we are choosing the label based on the length of the target, which results in the change in the result. THus the precision, recall, f1-score and support values are way lot different for Naive Bayes classifier than the other two models. 

------------------

#### Question 7: Experiment with at least 3 other thresholds of target length from Question 4. You should try setting various thresholds of target length to classify them as HIGH CONCRETE or LOW CONCRETE. For example, all words < 5 characters in length can be classified as low concrete. You should experiment with at least 3 different thresholds and document your reasons why you chose these thresholds. Report P, R and F-score and accuracy on the test set

#### Threshold 1: Considering length as 7

In [53]:
combined_train_target=np.where((combined_train['TARGET']).str.len() >=7, 'HIGH','LOW')
combined_test_target=np.where((combined_test['TARGET']).str.len() >=7, 'HIGH','LOW')
print(metrics.classification_report(combined_test['CONCRETE'],combined_test_target))

              precision    recall  f1-score   support

        HIGH       0.50      0.36      0.42        25
         LOW       0.27      0.40      0.32        15

    accuracy                           0.38        40
   macro avg       0.39      0.38      0.37        40
weighted avg       0.41      0.38      0.38        40



#### Threshold 2: Considering length as 2

In [54]:
combined_train_target=np.where((combined_train['TARGET']).str.len() >=2, 'HIGH','LOW')
combined_test_target=np.where((combined_test['TARGET']).str.len() >=2, 'HIGH','LOW')
print(metrics.classification_report(combined_test['CONCRETE'],combined_test_target))

              precision    recall  f1-score   support

        HIGH       0.62      1.00      0.77        25
         LOW       0.00      0.00      0.00        15

    accuracy                           0.62        40
   macro avg       0.31      0.50      0.38        40
weighted avg       0.39      0.62      0.48        40



#### Threshold 3: Considering length as 15

In [55]:
combined_train_target=np.where((combined_train['TARGET']).str.len() >=15, 'HIGH','LOW')
combined_test_target=np.where((combined_test['TARGET']).str.len() >=15, 'HIGH','LOW')
print(metrics.classification_report(combined_test['CONCRETE'],combined_test_target))

              precision    recall  f1-score   support

        HIGH       0.00      0.00      0.00        25
         LOW       0.38      1.00      0.55        15

    accuracy                           0.38        40
   macro avg       0.19      0.50      0.27        40
weighted avg       0.14      0.38      0.20        40



#### Reason for choosing these thresholds

I have choosen these thresholds to check now the precision, recall, f1-score and support values are appearing for a medium value, a low value and a high value and distinguishing the values for the given thresholds. WHile considering the low aand high threshold, it can be seen that the P, R and f1-score are 0 or low and high value, but we can observe that the f1-score accuracy is almost same for medium and high threshold.