<a href="https://colab.research.google.com/github/taylor-rao/Language-Detection-Neural-Net/blob/master/Language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

The purpose of this project is to create a neural network that can be feed text and detect which language it's written in. Even though I don't know any Spanish or German, I can distinguish between Spanish and German text just because I've learned to recognize different charactors and charactor patterns. At this point in its evolution, the program can tell the difference between English and Spanish.

##Imports

In [0]:
!pip install python-docx

Collecting python-docx
[?25l  Downloading https://files.pythonhosted.org/packages/e4/83/c66a1934ed5ed8ab1dbb9931f1779079f8bca0f6bbc5793c06c4b5e7d671/python-docx-0.8.10.tar.gz (5.5MB)
[K     |████████████████████████████████| 5.5MB 9.8MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.10-cp36-none-any.whl size=184490 sha256=d6113c8bf6d5d4046df3eda2e8a4d420867c7f20bfa602939de7ee0110ab31bc
  Stored in directory: /root/.cache/pip/wheels/18/0b/a0/1dd62ff812c857c9e487f27d80d53d2b40531bec1acecfa47b
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.10


In [0]:

import docx
from docx import Document
import numpy as np
import pandas as pd
from random import shuffle
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Activation

##Function definitions

In [0]:
def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

def purity(language):
  language = language.replace('\n', ' ')
  language = language.replace('\'', '')
  res = ''.join([i for i in language if not i.isdigit()])
  return res


def createSamples(string):
  subString = len(string)//100
  sampleString = []
  for i in range(0, subString):
    ind= i*100
    sampleString.append(string[ind:ind+100])
  return(sampleString)

def charCount(string):
  thisList = []
  totalSet = set(string)
  for i in totalSet:
    n = 0
    for j in string:
      if i == j:
        n += 1
    thisList.append((i,n))
  sortedList = sorted(thisList, key=lambda tup: tup[1])
  return sortedList

def cleanse (tup):
  myList = []
  n = 10000
  i = len(myList) - 1
  while n > 1000:
    myList.append(tup[i][0])
    n = tup[i][1]
    i -= 1
  return(myList)

def encoder(string, col):
  cask = np.zeros((len(string), len(col)), dtype = int)
  for i in range(len(string)):
    for j in range(len(col)):
      if string[i] == col[j]:
        cask[i][j] = 1
      else:
        cask[i][j] =0
  return cask.flatten()

def rsamples(l, length):
  indices = np.random.choice(len(l), length, replace=False)
  myList = []
  for i in indices:
    myList.append(l[i])
  return myList
    
    




#Data Wrangling

These blocks of code break break up the each file into 100 charactor chunks. Before these chunks can be fed into a neural network, they need to converted into numerical form which is done using one-hot encoding. 

In [0]:
spa = str(readtxt('spanish.docx'))
eng= str(readtxt('english.docx'))

In [0]:
spa = purity(spa)
eng = purity(eng)

In [0]:
engSamples = createSamples(eng)
spaSamples = createSamples(spa)

In [0]:
total = eng + spa

Form a list of these 100 charactor substrings and randomize the order in which they appear.

In [0]:
reng = rsamples(engSamples, 50000)
rspa = rsamples(spaSamples, 10000)

##Create list of charactors that occur more than 1000 times.

In order to shorten the input vectors, the uncommon charactors are left out of the columns for the one-hot encoded chunks. This is done to each 100 charactor chunk using the 'encoder' function defined at the top which takes a string as its first argument and the columns to be encoded onto as its second argument. 

In [0]:
charCount(total)

[('æ', 1),
 ('>', 1),
 ('<', 2),
 ('^', 2),
 ('œ', 3),
 ('î', 3),
 ('~', 4),
 ('À', 4),
 ('ô', 4),
 ('Í', 5),
 ('%', 6),
 ('ä', 6),
 ('ç', 7),
 ('@', 8),
 ('&', 8),
 ('â', 12),
 ('ê', 23),
 ('à', 28),
 ('ý', 32),
 ('©', 35),
 ('ö', 35),
 ('Ó', 36),
 ('ï', 39),
 ('É', 59),
 ('+', 89),
 ('ü', 98),
 ('Á', 105),
 ('$', 111),
 ('¿', 112),
 ('/', 130),
 ('¡', 180),
 ('ë', 240),
 ('è', 267),
 ('‘', 268),
 ('Z', 301),
 ('Q', 349),
 ('|', 408),
 ('[', 418),
 (']', 421),
 ('»', 446),
 ('«', 449),
 ('Ú', 516),
 ('*', 716),
 ('#', 741),
 ('X', 1271),
 ('=', 1762),
 ('ñ', 1935),
 ('ú', 1978),
 ('U', 2097),
 ('—', 2120),
 ('J', 2145),
 (')', 2368),
 ('(', 2369),
 ('K', 2815),
 ('V', 3408),
 (':', 3420),
 ('Y', 3852),
 ('é', 4301),
 ('L', 4424),
 ('G', 4686),
 (';', 4953),
 ('_', 5486),
 ('O', 5923),
 ('á', 6151),
 ('F', 6894),
 ('D', 7032),
 ('’', 7260),
 ('?', 7282),
 ('!', 8327),
 ('”', 8970),
 ('W', 8973),
 ('“', 9003),
 ('R', 9102),
 ('í', 9303),
 ('E', 9412),
 ('z', 9458),
 ('B', 10009),
 ('C',

In [0]:
usedChars = cleanse(charCount(total))
usedChars

In [0]:
elist = []
for i in reng:
  elist.append([encoder(i, usedChars), 0])

In [0]:
slist = []
for i in rspa:
  slist.append([encoder(i, usedChars), 1])
  

In [0]:
tlist = elist + slist

In [0]:
shuffle(tlist)

##Create predictors and target.

In [0]:
data = np.zeros((len(tlist),7400), dtype = int)
for i in range(len(tlist)):
  data[i] = tlist[i][0]
  

In [0]:
labels = np.zeros(len(tlist), dtype = int)
n=0
for i in tlist:
  labels[n] = tlist[n][1]
  n += 1

labels = to_categorical(labels)

#Modeling

In [0]:
model = Sequential()
model.add(Dense(32, input_shape=(7400,)))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))

W0803 13:08:24.930870 140612298770304 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0803 13:08:24.988192 140612298770304 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0803 13:08:24.997697 140612298770304 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



#Results

In [0]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(data, labels)

W0803 13:10:20.453600 140612298770304 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0803 13:10:20.513772 140612298770304 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Epoch 1/1


<keras.callbacks.History at 0x7fe2996eec88>

#Conclusion and next steps

This model reached over 98% accuracy on testing data and theres no reason to think that it couldn't get even more accurate if fed more data. The next steps here would be to change the code to make it easy to add other languages. Some basic webscraping could be done to build datasets similar to the ones in spanish and english I used here.