# Language Identification Hackathon

©  Explore Data Science Academy

---

### Honour Code

I ******, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<img src="climate_change.jpg" width="800px">
    <figcaption><p text_align = "center">
    
    
## Language identification

Overview
South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
From South African Government

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

<a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section the required packages are imported, and briefly discuss, the libraries that will be used throughout the analysis and modelling. |

In [None]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd
import tensorflow as tf

# Libraries for data preparation 

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer


from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import nltk

# Libraries for data visualizations
import matplotlib.pyplot as plt
import seaborn as sns



# Building classification models
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC


# Libraries for assessing model accuracy 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report


from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score


from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator

# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
train_df = pd.read_csv('train_set.csv')
test_df = pd.read_csv('test_set.csv')

Let's get the number of rows and columns of the train and test datasets,also lets have a preview of the first few rows of tha datasets.

In [None]:
print(train_df.shape)
print( test_df.shape)

display(train_df.head())
display(test_df.head())

In [None]:
type_labels = list(train_df.lang_id.unique())
print(type_labels)

###### Create a copy

The first step in the preprocessing is to create a copy of the train dataframe for the EDA.

In [None]:
# Bar plot of label classes
fig,ax = plt.subplots()
train_df['lang_id'].value_counts().plot(kind = 'bar', facecolor='g', alpha=0.65)
ax.set_xlabel('lang_id')
ax.set_ylabel('lang_id count')
plt.show()

### Tokenisation

A tokeniser divides text into a sequence of tokens, which roughly correspond to "words" (see the [Stanford Tokeniser](https://nlp.stanford.edu/software/tokenizer.html)). We will use tokenisers to clean up the data, making it ready for analysis.

In [None]:
''.join(train_df['text'].tolist())[:2000]

In [None]:
def tweet_cleaner(tweet):
    
    # replace the html characters with " "
    tweet=re.sub('<.*?>', ' ', tweet)
    
    #Removal of numbers
    tweet= re.sub(r'\d+', ' ', tweet)

    
    #convert the tokens back to a string
    tweet =' '.join(tweet)
    
    return tweet


In [None]:
train_df['text'] = train_df['text'].apply(tweet_cleaner)

In [None]:
''.join(train_df['text'].tolist())[:1000]

The use of preprocessed data in Word Cloud makes it easy to identify the relevant words as opposed to many instances of https and other types of noise. 

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


# 5. Modeling

For modeling, we will convert the dataframes to a vector form for feature extraction.

I will be using the TFIDF vectorizer. This vectorizer is better than bag of words or countvectorizer because it makes use of both the word frequency and importance. Countvectorizer makes use of only the word frequency.

In [None]:
# instantiate the tfidf vectorizer
tfidf = TfidfVectorizer(max_features=5000)
# separate into x and y
# Seperate features and tagret variables
X = train_df['text']
y = train_df['lang_id']

## Building classification models

We will be making use of a pipeline to build our classification models. This pipeline will vectorize the text data before fitting it to our chosen model.

The following 5 models will be considered:

- Random forest
- Naive Bayes
- K nearest neighbors
- Logistic regression
- Linear SVC

#### Train - Validation split

Before we pass our data through our custom pipelines we have to split our train data into features and target variables. After this step we can split our train data into a train and validation set. This will allow us to evaluate our model performance and chose the best model to use for our submission

In [None]:
# Split the dataset into train & validation (20%) for model training



# Split the train data to create validation dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train

#### Pipelines

Pipelines consist of 2 steps, vectorization and model fitting.

Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, we need to convert our text into numbers.

The TFIDF vectorizer assigns word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Another advantage of this method is that the resulting vectors are already scaled.

In [None]:
# Random Forest Classifier
rf = Pipeline([('tfidf', TfidfVectorizer(max_features=5000)),
               
               ('clf', RandomForestClassifier(max_depth=5,n_estimators=100))
              ])


# K-NN Classifier
knn = Pipeline([('tfidf', TfidfVectorizer(max_features=5000)),
                ('clf', KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2))
               ])


# Logistic Regression
lr = Pipeline([('tfidf',TfidfVectorizer(max_features=5000)),
               ('clf',LogisticRegression(C=1,class_weight='balanced',max_iter=1000))
              ])


#### Train the models

The models are trained by passing the train data through each custom pipeline. The trained models are then used to predict the classes for the validation data set.

In [None]:
#train and predict Random forest 
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_valid)

#train and predict K - nearest neighbors
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_valid)

#train and predict Linear regression
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_valid)


## Model evaluation

The performance of each model will be evaluated based on the precision, accuracy and F1 score achieved when the model is used to predict the classes for the validation data. We will be looking at the following to determine and visualize these metrics:

Classification report
Confusion matrix
The best model will be selected based on the weighted F1 score.

In [None]:
# Generate a classification Report for the random forest model
print(metrics.classification_report(y_valid, y_pred_rf))



In [None]:
# Generate a classification Report for the K-nearest neighbors model
print(metrics.classification_report(y_valid, y_pred_knn))



In [None]:
# Generate a classification Report for the model
print(metrics.classification_report(y_valid, y_pred_lr))



## Model Selection

Linear SVC has achieved the highest F1 score of 0.70 and is therefore our model of choice moving forward.

### Cleaning  Test_data

In [None]:
test_df ['text'] = test_df ['text'].apply(tweet_cleaner)
test_df.head()

### generate the csv file to submmit to kaggle

In [None]:
def gen_kaggle_csv(model, df):
    
    #load the test data to a varable "X_unseen"
    X_test = df['text']
    
    #Make a prediction on the test data with the trained model
    mypreds = model.predict(X_test)
    
    #Reset the index of the test data
    df.reset_index(inplace=True)
    
    #Make a copy of the tweet id 
    index = df['index']
    
    #Convert the tweet_id and the prediction 
    kaggle = pd.DataFrame({'index' : index, 
                                  'lang_id': mypreds})
    
    #convert file to csv
    kaggle.to_csv('kaggle.csv', index=False)

    return kaggle
gen_kaggle_csv(rf, test_df)

### Pickle Trained Model

In [None]:
def save_pickle_file(model, file_name):
    # import the pickle module
    import pickle
    
    #asign a path to the file_name 
    model_save_path = file_name 
    
    #save file to thespecified path
    with open(model_save_path,'wb') as file: 
        pickle.dump(model,file)
    
    return  model_save_path

save_pickle_file(lsvc, "lsvc_model.pkl")