<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section, I will be importing the libraries that will be used throughout my analysis and modelling. |

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import nltk
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer , CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section I will be loading the data from the `train_set` file into a DataFrame. |

---

In [4]:
# load the data
df = pd.read_csv('train_set.csv')
df1 = pd.read_csv('test_set.csv')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [None]:
# look at data statistics
df.shape

In [None]:
df1.shape

In [None]:
df.head(5)

In [None]:
df1.head(5)

In [None]:
df.isnull().sum()

In [None]:
df1.isnull().sum()

In [None]:
# plot relevant feature interactions

In [None]:
type_labels = list(df.lang_id.unique())
print(type_labels)

In [None]:
df['lang_id'].value_counts().plot(kind = 'bar')
plt.show()

In [None]:
df['lang_id'].value_counts()

In [None]:
length_train_set = df['text'].str.len()
length_test_set = df1['text'].str.len()
plt.hist(length_train_set , bins = 15, label = 'Train text')
plt.hist(length_test_set , bins = 40, label = 'Test text')
plt.legend()
plt.show()

In [None]:
df['text'].value_counts()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section I will be cleaning the dataset, and possibly create new features. |

---

In [5]:
vect = CountVectorizer()
vect.fit(df['text'])
new = vect.transform(df['text'])

In [6]:
new2 = vect.transform(df1['text'])
new3 = new2.toarray()

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, I will create one or more models that will be able to accurately predict the Language. |

---

In [8]:
X = new
y = df['lang_id']

In [9]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=50)

In [10]:
naive_model = MultinomialNB().fit(X_train , y_train)


In [11]:
ypred = naive_model.predict(new3)

In [12]:
from sklearn.metrics import accuracy_score
print('Accuracy of Naive Classifier: {:.2f}'.format(accuracy_score(y[:5682], ypred)))

Accuracy of Naive Classifier: 0.09


In [13]:
output4 = pd.DataFrame({'lang_id': ypred})
output4 = df1[['index']]
output4['lang_id'] = ypred
output4.to_csv('logreg_final.csv', index=False)
output4.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr


In [None]:
output4.shape

In [None]:
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred = tree.predict(new3)

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy of tree Classifier: {:.2f}'.format(accuracy_score(y[:5682], y_pred)))

In [None]:
output5 = pd.DataFrame({'lang_id': y_pred})
output5 = df1[['index']]
output5['lang_id'] = y_pred
output5.to_csv('tree.csv', index=False)
output5.head()

In [None]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)
pred_forest = forest.predict(new3)

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy of tree Classifier: {:.2f}'.format(accuracy_score(y[:5682], pred_forest)))

In [None]:
output6 = pd.DataFrame({'lang_id': pred_forest})
output6 = df1[['index']]
output6['lang_id'] = pred_forest
output6.to_csv('forest.csv', index=False)
output6.head()

In [None]:
ada_boost =AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=42)
ada_boost.fit(X_train, y_train)
pred_ada = ada_boost.predict(new3)

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy of Ada Classifier: {:.2f}'.format(accuracy_score(y[:25000], pred_ada)))

In [None]:
output7 = pd.DataFrame({'lang_id': pred_ada})
output7 = df1[['index']]
output7['lang_id'] = pred_ada
output7.to_csv('ada.csv', index=False)
output7.head()