## Introduction
    South Africa is a Multi Lingual speaking country with 11 Official Languages. Most of its citizens are able to speak more than one of these Languages. This project would create a model that can take in a text data in any of the 11 South African languages, and predict what language it is.
    

The relevant libraries needed for this project would be imported

In [17]:
# Importing relevant libraries
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/south-african-language-identification-hack-2022/sample_submission.csv
/kaggle/input/south-african-language-identification-hack-2022/test_set.csv
/kaggle/input/south-african-language-identification-hack-2022/train_set.csv


Due to how large the data set is, this analysis was perform directly on Kaggle and the data from kaggle was used.

In [None]:
#loading the data
#train data
df_train = pd.read_csv('/kaggle/input/south-african-language-identification-hack-2022/train_set.csv')
#test data
df_test = pd.read_csv('/kaggle/input/south-african-language-identification-hack-2022/test_set.csv')


Lets take a look at the data

In [4]:
df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
df_test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


The test data does not have the Lang_id column this is because we would use it to test our data directly on Kaggle

Before building the model, some preprocessing of the data needs to be done to get the data in the right shape.
**CountVectoriser** would be used to transform the text data into numeric figures, fit for analysis

In [19]:
#data preprocessing
#vectorising the data with 70000 features
vect = CountVectorizer(lowercase=True,max_features=100000)
#fitting the vectoriser to the train data
X_train = vect.fit_transform(df_train['text']) #creating train features
X_train= pd.DataFrame(X_train.A, columns=vect.get_feature_names()) #convert to dataframe
y_train = df_train['lang_id'] # creating train labels



In [7]:
#transforming test data
X_test = vect.transform(df_test['text']) #creating test features
X_test = pd.DataFrame(X_test.A, columns=vect.get_feature_names()) #converting to DataFrame

In [9]:
#checking the test feature
X_test.head()

Unnamed: 0,aa,aabameli,aaent,aak,aan,aanbeveel,aanbeveling,aanbevelings,aanbevole,aanbied,...,ṱuvhana,ṱuwa,ṱuwe,ṱuwedza,ṱuwedzi,ṱuṱuwedza,ṱuṱuwedzaho,ṱuṱuwedze,ṱuṱuwedzea,ṱuṱuwedzwa
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
#checking the train feature
X_train.head()

Unnamed: 0,aa,aabameli,aaent,aak,aan,aanbeveel,aanbeveling,aanbevelings,aanbevole,aanbied,...,ṱuvhana,ṱuwa,ṱuwe,ṱuwedza,ṱuwedzi,ṱuṱuwedza,ṱuṱuwedzaho,ṱuṱuwedze,ṱuṱuwedzea,ṱuṱuwedzwa
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Columns: 100000 entries, aa to ṱuṱuwedzwa
dtypes: int64(100000)
memory usage: 24.6 GB


In [15]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Columns: 100000 entries, aa to ṱuṱuwedzwa
dtypes: int64(100000)
memory usage: 4.2 GB


In [16]:
#checking the amount of unique languages in our data
y_train.unique()

array(['xho', 'eng', 'nso', 'ven', 'tsn', 'nbl', 'zul', 'ssw', 'tso',
       'sot', 'afr'], dtype=object)


In trying to predict and get the best model, the following models where used.
1. Logistic Classifier
2. Decision tree Classifier
3. Random Forest Classifier
4. Multinomial NaiveBase Classifier

In [11]:
lm = LogisticRegression() #creating the model

In [12]:
#fitting the data
lm.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [None]:
#predicting the training data
y_pred = lm.predict(X_train)

In [None]:
#predicting the test data 
y_pred_test = lm.predict(X_test)

To test the predictive ability of this model a submition file needs to be created to submit to kaggle.

In [None]:
# Creating Submittion file
submission = pd.DataFrame(df_test['index']) #getting the index column
submission['lang_id'] = pd.DataFrame(y_pred_test) #joining the lang_id column
submission.to_csv('submission.csv', index=False) # saving the file as csv


**KnearestNeighbors**

In [None]:
knn = KNeighborsClassifier() # creating the model
knn.fit(X_train, y_train) # fitting the data

In [None]:
y_hat = knn.predict(X_test) #predicting test data

In [None]:
# Creating Submittion file
submission = pd.DataFrame(df_test['index'])
submission['lang_id'] = pd.DataFrame(y_hat)
submission.to_csv('submission.csv', index=False)

**DecisionTreeClassifier**

In [None]:
tree = DecisionTreeClassifier(max_depth=5) #creating the model

In [None]:
tree.fit(X_train,y_train) #fitting the train data

In [None]:
tree.predict(X_train) #predicting the train data

In [None]:
y_tree_pred = tree.predict(X_test) #predicting the test data

In [None]:
# Creating submittion file
submission = pd.DataFrame(df_test['index'])
submission['lang_id'] = pd.DataFrame(y_hat)
submission.to_csv('submission.csv', index=False)

**RandomForestClassifier**

In [None]:
# Creating the model
rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)

In [None]:
rfc.fit(X_train,y_train) # fitting the data
rfc.predict(X_train) # predicting the test data

In [None]:
y_rfc_pred = rfc.predict(X_test) #predicting the test data

**NaiveBayes(Multinomial)**

In [20]:
mn =MultinomialNB() #Creating the data

In [None]:
mn.fit(X_train, y_train) #fitting the data

In [None]:
mn.predict(X_train) #predicting the train data

In [None]:
mn_y_pred = mn.predict(X_test) #predicting the test data

In [None]:
#Creating submittion file
submission = pd.DataFrame(df_test['index'])
submission['lang_id'] = pd.DataFrame(mn_y_pred )
submission.to_csv('submission.csv', index=False)

Each of the models created above and their predictions were tested on kaggle using the F1 score to get the best performing model.
The best model with the highest F1 score was the NaiveBayes(multinomial) model qith a score of **0.9604**