# The Language Classification Problem

Due to time constraints a full report was not constructed but the code was attempted.

## Libraries 
Pandas was used to manipulate the data.
Textblog was used to process the tectual data and it has a built in NaiveBayes Classifier.
Numpy was used for it's random function.


In [1]:
import pandas as pd  
import numpy as np  
from textblob.classifiers import NaiveBayesClassifier  
import os
os.chdir('/home/andrew/Documents/MachineL/Post/Language-Classification')   # Set file directory
datafull = pd.read_csv('lang_data.csv')  # Read and assign data file
datafull.head()   

Unnamed: 0,text,language
0,Ship shape and Bristol fashion,English
1,Know the ropes,English
2,Graveyard shift,English
3,Milk of human kindness,English
4,Touch with a barge-pole - Wouldn't,English


## Preparation 
Remove the missing values from the text column.

In [2]:
df = datafull[pd.notnull(datafull['text'])]

## Two stage classification

There are few Nederlands examples relative to English and Afrikaans, to solve this class imbalance problem the model will use a two-stage classification. 

1st stage: Differentiate between English or (Afrikaans + Nederlands). 

2nd: Differentiate between Afrikaans or Nederlands. 

This requires the combination of all the Afrikaans and Nederlands rows for the first classifier and the second classifier will only look at the Afrikaans and Nederlands (When they are not combined).


### 1st stage data preparation

In [4]:
# Makes a column in a dataframe with'Afrikaans+Nederlands' 
filename = 'Afikaans+Nederlands'
numbers = np.random.randn(2839) # must be larger that dataframe and join will match the columns
tf = pd.DataFrame({'language': filename , 'numbers': numbers})
AfriNedercol = tf.drop(['numbers'], axis=1)


# Pull all the rows with Nederland and Afrikaans and then drop the language column
dfNederAfri = df[(df['language'] == "Nederlands") | (df['language'] == "Afrikaans")]
dfNederAfri_nocol = dfNederAfri.drop(['language'], axis=1)

#Add the column Afrikaans + Nederlands
temp_AfriNeder = dfNederAfri_nocol.join(AfriNedercol)

# Pull all the English rows
Engcol = df[(df['language'] == "English")]

# Add all the English rows to the new 'Afrikaans + English' rows
df_Eng_or_AfriandNeder = Engcol.append(temp_AfriNeder) 
df_Eng_or_AfriandNeder.tail()

Unnamed: 0,text,language
2821,Vergelyk eers wat jy het met wat jy nie het ni...,Afikaans+Nederlands
2825,Wat is die belangrikste – geluk of sukses?,Afikaans+Nederlands
2826,"Waar rook is, is vuur.",Afikaans+Nederlands
2832,Die boer se voetspore is die beste misstof op ...,Afikaans+Nederlands
2834,Daar’s ‘n geurtjie aan.,Afikaans+Nederlands


In [5]:
# Convert the dataframe to list: one possible input criteria for textblob
list_Eng_or_AfriandNeder=[]
df_Eng_or_AfriandNeder

for row in df_Eng_or_AfriandNeder.iterrows():
    index, data = row
    list_Eng_or_AfriandNeder.append(data.tolist())

type(list_Eng_or_AfriandNeder)

list

### 1st stage classification

The first model gets trained on the data set that has the rows combined for Afrikaans and Nederlands (This data also includes the English columns).

In [6]:
# Training the textblob classifier 
Cl_Eng_or_AfriandNeder = NaiveBayesClassifier(list_Eng_or_AfriandNeder) # To add a feature (, feature_extractor=new_feature)

# Cl_Eng_or_AfriandNeder.classify("Put words to be translated here") # <- How to call the classification

### 2nd stage data preparation

In [7]:
# Take the Afrikaans and Nederlands part of the dataframe and puts it into a list

list_NederorAfri=[]
dfNederAfri = df[(df['language'] == "Afrikaans") | (df['language'] == "Nederlands")]

for row in dfNederAfri.iterrows():
    index, data = row
    list_NederorAfri.append(data.tolist())
type(list_NederorAfri)

list

### 2nd stage classification

The second model then trained on only Afrikaans and Nederlands rows (NB this model was not trained on any English rows)

In [8]:
# Trains the Classifier to determine between Afrikaans and Nederlands
Cl_Neder_or_Afri = NaiveBayesClassifier(list_NederorAfri) 

# Cl_Neder_or_Afri.classify("Put words to be translated here") # <- How to call the classification

## Combining stages into SelectLanguage function

The first if statement determines if the language is English or Afrikaans + Nederlands.

The second if statement then determines if the language is Afrikaans or Nederlands.

In [9]:
def SelectLanguage(wordsin): 
    
    if (Cl_Eng_or_AfriandNeder.classify(wordsin) == 'English'):
        return 'English'
    
    elif (Cl_Neder_or_Afri.classify(wordsin) == 'Afrikaans'):
        return 'Afrikaans'
    else:
        return 'Nederlands'        

In [10]:
# Testing ground: Can call the function SelectLanguage 
print(SelectLanguage("You should not sell the skin before the bear is shot."))
print(SelectLanguage("Jy moet nie die vel verkoop voordat die beer geskiet is nie"))
print(SelectLanguage("Je moet de huid niet verkopen voordat de beer geschoten is."))

English
Afrikaans
Nederlands


##  Model’s architecture

The textblob classifier uses separate words as features. Some additional features where tested (Vowel ratio, Repeating letters, Length) but these either didn't work or overfit. The accuracy goes up with an increase in the number of words used in the string. 

In [13]:
Cl_Eng_or_AfriandNeder.show_informative_features(5)      # English or (Afrikaans + Nederlands)

Most Informative Features
             contains(n) = True           Afikaa : Englis =    373.2 : 1.0
           contains(Die) = True           Afikaa : Englis =    169.6 : 1.0
           contains(die) = True           Afikaa : Englis =    147.7 : 1.0
             contains(a) = True           Englis : Afikaa =     54.7 : 1.0
             contains(A) = True           Englis : Afikaa =     30.6 : 1.0


In [14]:
Cl_Neder_or_Afri.show_informative_features(5)            # Afrikaans of Nederlands

Most Informative Features
           contains(Wie) = True           Nederl : Afrika =     40.8 : 1.0
           contains(Het) = True           Nederl : Afrika =     22.0 : 1.0
           contains(één) = True           Nederl : Afrika =     13.2 : 1.0
         contains(blind) = True           Nederl : Afrika =      9.4 : 1.0
           contains(oog) = True           Nederl : Afrika =      9.4 : 1.0


## Accuracy of the model

In [15]:
# Setting an index column so that that the type of language can be put into a for loop
df.set_index("language", inplace=True)
df.head()

Unnamed: 0_level_0,text
language,Unnamed: 1_level_1
English,Ship shape and Bristol fashion
English,Know the ropes
English,Graveyard shift
English,Milk of human kindness
English,Touch with a barge-pole - Wouldn't


In [16]:
# Pulling all the English rows and counting the number of times the SelectLanguage function gets it correct
English = df.loc['English']
list_English = English['text'].tolist()

e=0
for x in range(len(list_English)):
    
    if(SelectLanguage(list_English[x]) == 'English'):
        e = e+1
e        

2055

In [17]:
# Pulling all the Afrikaans rows and counting the number of times the SelectLanguage function gets it correct
Afrikaans = df.loc['Afrikaans']
list_Afrikaans = Afrikaans['text'].tolist()

a=0
for x in range(len(list_Afrikaans)):
    
    if(SelectLanguage(list_Afrikaans[x]) == 'Afrikaans'):
        a = a+1
a

616

In [18]:
# Pulling all the Nederlands rows and counting the number of times the SelectLanguage function gets it correct
Nederlands = df.loc['Nederlands']
list_Nederlands = Nederlands['text'].tolist()

n=0
for x in range(len(list_Nederlands)):
    
    if(SelectLanguage(list_Nederlands[x]) == 'Nederlands'):
        n = n+1
n

40

In [19]:
# Results
# The relative proportion of the number of times SelectLanguage got it correct over the length of the list
print('The accuracy of English is =', e/len(list_English))
print('The accuracy of Afrikaans is =', a/len(list_Afrikaans))
print('The accuracy of Nederlands is =', n/len(list_Nederlands))

# The SelectLanguage function has a low accuracy with Nederlands

The accuracy of English is = 1.0
The accuracy of Afrikaans is = 0.9640062597809077
The accuracy of Nederlands is = 0.5970149253731343


## Improvements
The Nederlands predictor is very low, it could possibly be improved
by incorporating letter grouping (for example groups of three letters that are more likely to be a specific language).