### South African Language Identification

## Table of Content

1. Introduction
2. Problem Statement
3. Loading Packages
4. Loading the Data
5. Exploratory Data Analysis (EDA)
6. Data Engineering
7. Modelling
8. Model Explanation
9. Conclusion

## 1. Introduction

South Africa is a vibrant and multicultural society with a rich linguistic tapestry. Its diverse population,
encompassing various languages, underscores the importance of language as a vital tool for enhancing democracy and
enriching the social, cultural, intellectual, economic, and political fabric of the nation. In light of this linguistic
diversity, it becomes imperative for our systems and devices to support communication in multiple languages, reflecting the 
inclusive nature of South African society.

In [None]:

## 2. Problem Statement

In this task, the objective is to employ Natural Language Processing (NLP) for Language Identification in South Africa's 11 Official languages. The goal is to accurately determine the language of a given text, showcasing the application of NLP techniques in discerning the natural language in which a piece of text is written.

## 3 Importing Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
from scipy.sparse import hstack
%matplotlib inline
import seaborn as sns
import re
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk

sns.set()
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.probability import FreqDist

from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from wordcloud import WordCloud
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix

In [2]:
from nltk.corpus import stopwords

In [3]:
pip install wordcloud

Note: you may need to restart the kernel to use updated packages.


## 4. Loading the Data

In [6]:
#Load the test and train data
df_train = pd.read_csv(r"C:\Users\Zinhle\Downloads\south-african-language-identification-hack-2023\train_set.csv")
df_test = pd.read_csv(r"C:\Users\Zinhle\Downloads\south-african-language-identification-hack-2023\test_set.csv")

df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [7]:
#Printing the test dataset
df_test

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.
...,...,...
5677,5678,You mark your ballot in private.
5678,5679,Ge o ka kgetha ka bowena go se šomiše Mofani k...
5679,5680,"E Ka kopo etsa kgetho ya hao ka hloko, hobane ..."
5680,5681,"TB ke bokudi ba PMB, mme Morero o tla lefella ..."


## 5. Exploratory Data Analysis (EDA)

### 5.1 Reading the train dataset

In [8]:
#Looking at the train data
df_train = pd.read_csv(r"C:\Users\Zinhle\Downloads\south-african-language-identification-hack-2023\train_set.csv")
df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


### 5.2 Analysing the Data


In [9]:
# Looking at the shape of the train dataset.

df_train.shape

(33000, 2)

In [10]:
# looking at the shape of the test data

df_test.shape

(5682, 2)

In [11]:
#checks what information does the train datasets holds
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


In [12]:
#checks what information does the test datasets holds
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB


In [13]:
#checks columns available in the train data set
df_train.columns

Index(['lang_id', 'text'], dtype='object')

In [14]:
#checks columns available in the test data set
df_test.columns

Index(['index', 'text'], dtype='object')

In [15]:
#checking for nulls in the train data.
df_train.isnull().sum()

lang_id    0
text       0
dtype: int64

In [16]:
#checking for nulls in the test data.
df_test.isnull().sum()

index    0
text     0
dtype: int64

In [17]:
#checks the last 5 rows of the train  dataset given.
df_train.tail()

Unnamed: 0,lang_id,text
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999,sot,mafapha a mang le ona a lokela ho etsa ditlale...


In [18]:
#checks the last 5 rows of the test dataset provided.
df_test.tail()

Unnamed: 0,index,text
5677,5678,You mark your ballot in private.
5678,5679,Ge o ka kgetha ka bowena go se šomiše Mofani k...
5679,5680,"E Ka kopo etsa kgetho ya hao ka hloko, hobane ..."
5680,5681,"TB ke bokudi ba PMB, mme Morero o tla lefella ..."
5681,5682,Vakatjhela iwebhusayidi yethu ku-www.


In [19]:
#Checking for duplicates in the train dataset.
df_train.nunique()

lang_id       11
text       29948
dtype: int64

In [20]:
#Checking for duplicates in the test dataset.
df_test.nunique()

index    5682
text     5459
dtype: int64

In [21]:
#Statistics summary of all train data
df_train.describe(include='all').T

Unnamed: 0,count,unique,top,freq
lang_id,33000,11,xho,3000
text,33000,29948,ngokwesekhtjheni yomthetho ophathelene nalokhu...,17


In [22]:
#Summary of all test data
df_test.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
index,5682.0,,,,2841.5,1640.396446,1.0,1421.25,2841.5,4261.75,5682.0
text,5682.0,5459.0,Hoekom moet Onderhoud Betaal word?,6.0,,,,,,,


In [27]:
#This piece of code counts the value to understand the distribution of categories for train data.
df_train.value_counts()

lang_id  text                                                                                                                                                                                                                                                                                             
nbl      ngokwesekhtjheni yomthetho ophathelene nalokhu unelungelo lokudlulisela isililo sakho kusomkhandlu wezehlalakuhle ngokutlola incwadi uyithumele e-adresini elandelako kungakapheli amalanga amatjhumi alithoba ukusukela mhlazana uthola incwadi le                                                  17
         ukubhalelwa kuzalisa iimfuneko zomthetho ophathelene nalokhu kungawufelelisa umrholwakho naweqisa iinyanga ezintathu ngokulandelana ungawuthathi umrholwakho nakhona uzakufelela umrholo owuthole ngokungakafaneli kufuze uwubuyise                                                                  14
         imali osalele ngayo emva nayo seyifakiwe emrholweni wakho wokuthoma nakungenzeka u

In [26]:
#This piece of code counts the value to understand the distribution of categories for the test data.
df_test.value_counts()

index  text                                                                                                                                                                                                                                                     
1      Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.                                                                                                                                                                                                   1
3818   a ka tsenngwang mo tirisong.                                                                                                                                                                                                                                 1
3794   nyimele dza muthu dza mahoro, nga maan.                                                                                                                                                                             