# CMS Q3 Project: Language detection

## The data

This project makes use of the Wikicorpus, a scraping of the English, Spanish, and Catalan versions of Wikipedia. It has over 750 million words.

I downloaded the full English and Spanish texts. The Spanish data has about 120 million words, the English about 600 million. The end goal is to be able to type in some words in either Spanish or English and have a neural net recognize which one it is, and to then investigate the workings of the neural net to find out which features of the text are important to language detection.

The project only makes use of the first 10,000 words in each set for training; it simply takes too long to process otherwise.

## Text processing

In [187]:
import numpy as np
import nltk
import os

file_content = [[],[]]

#import files as readable objects
for lang in ['en', 'es']:
    #set up directories for finding files
    dir = "/Users/tulasiholdridge/Downloads/CMS Q3/{}".format('raw.'+lang)    #path to files
    
    if lang=='en':
        index=0
    else:
        index=1
        
    for file in sorted(os.listdir(dir)):
        content = open(os.path.join(dir, file), encoding="latin-1")
        file_content[index].append(content)
    
    file_content = np.array(file_content)
    
raw_data = []
labels = []

#read first 10,000 words of each doc and add to raw dataset
raw_data.append(file_content[0][0].read())
labels.append('en')

raw_data.append(' '.join([file_content[1][0].read(), file_content[1][1].read()]))
labels.append('es')

In [188]:
print(raw_data[0][:150])    #raw english text
print('\n************\n')
print(raw_data[1][:150])    #raw spanish text

<doc id="214730" title="Henry Hallam" nonfiltered="1" processed="1" dbindex="0">
Henry Hallam (July 9, 1777 - January 21, 1859) was an  English histor

************

<doc id="20540" title="658" nonfiltered="1" processed="1" dbindex="10000">

 Acontecimientos .


 Nacimientos .


 Fallecimientos .
Fulgencio de Écija


In [189]:
from bs4 import BeautifulSoup    #will help with text cleaning
import re    #regex

def process_wikicorpus(raw):
    '''
    Takes list where each item is a text. Returns processed version of said text (no html, punctuation, etc.) in 
    a list with the same order as the original.
    '''
    
    processed = []
    total = len(raw)
    current = 1
    for lang in raw:
        print('processing {} of {}'.format(current, total))
        current += 1

        #process w/ BeautifulSoup
        print('...parsing')
        soup = BeautifulSoup(lang, 'html.parser')
        text = soup.get_text()

        #continue cleaning — remove whitespace and remove punctuation
        print('...cleaning')
        text = re.sub(r"\s+", " ", text)
        tokens = nltk.word_tokenize(text)
        lower_words = [word.lower() for word in tokens if word.isalpha() and word!='ENDOFARTICLE']
        
        print('\n')
        processed.append(lower_words)
    
    print('Done!')
    return processed

In [190]:
#takes a while
processed_data = process_wikicorpus(raw_data)

processing 1 of 2
...parsing
...cleaning


processing 2 of 2
...parsing
...cleaning


Done!


In [191]:
print(processed_data[0][:20])    #clean english text
print('\n************\n')
print(processed_data[1][:20])    #clean spanish text

['henry', 'hallam', 'july', 'january', 'was', 'an', 'english', 'historian', 'the', 'only', 'son', 'of', 'john', 'hallam', 'canon', 'of', 'windsor', 'and', 'dean', 'of']

************

['acontecimientos', 'nacimientos', 'fallecimientos', 'fulgencio', 'de', 'écija', 'santo', 'español', 'erquinoaldo', 'mayordomo', 'franco', 'de', 'palacio', 'de', 'neustria', 'acontecimientos', 'nacimientos', 'egilona', 'última', 'reina']


In [192]:
#calculate (cutoff shortest length, rounded down to be divisible by 100)
cutoff = len(min(processed_data, key=len))
adjusted_cutoff = int(100*np.floor(cutoff/100))

processed_data[0] = processed_data[0][:adjusted_cutoff]
processed_data[1] = processed_data[1][:adjusted_cutoff]

In [193]:
#make packets of 100 words
num_packets = adjusted_cutoff//100
final_data = np.empty(shape=(2, num_packets)).tolist()

#join words in packets
for i, lang in enumerate(processed_data):
    for j in range(num_packets):
        final_data[i][j] = ' '.join(processed_data[i][j*100:j*100+100])



Done!


## Text encoding

In [208]:
letters = ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'g', 'h', 'i', 'í', 'j', 'k', 'l', 'm', 'n', 
           'ñ', 'o', 'ó', 'p', 'q', 'r', 's', 't', 'u', 'ú', 'v', 'w', 'x', 'y', 'z', ' ']

def packet_to_matrix(packet, letter_list):
    packet_length = len(packet)
    letter_length = len(letter_list)
    one_hot_packet = np.zeros(packet_length*(letter_length+1))
    one_hot_packet = np.reshape(one_hot_packet, (packet_length, letter_length+1))
    
    for i in range(packet_length):
        try:
            index = letter_list.index(packet[i])
        except ValueError:
            index = letter_length-1
            
        one_hot_packet[i][index] = 1
        
    return one_hot_packet

In [209]:
#matrices of letter frequency in each packet which I probably could have done with nltk but oh well
letter_freqs = np.zeros([2, num_packets, len(letters)+1])

for i, lang in enumerate(final_data):
    for j in range(num_packets):
        temp = packet_to_matrix(final_data[i][j], letters)
        letter_freqs[i][j] = np.sum(temp, axis = 0)

## Separate training and test data

## Training neural net

## Analysis of results

## Try it yourself!

## Sources (so far)
https://www.quora.com/How-do-I-read-mutiple-txt-files-from-folder-in-python

https://www.w3schools.com/python/python_regex.asp

https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks