# Free Text Analysis on Data Corpus Written in Multiple Languages

This data was obtained from an online survey deployed in a MOOC called conversational English. The MOOC (Massive Open Online Course) was deployed on the OpenEdX plaform, and is targetted to students who want to learn how to speak English in conversation settings. 

The survey asked students about their motivation to engage in an English speaking MOOC.


Our survey drew 34,100 respondents from 166 countries; the top 10 countries represented include India (13.63%), Brazil(9.48%), Colombia (8.04%), Mexico (6.76%), Egypt (5.76%), China (3.5%), Spain (3.15%), Vietnam (2.21%), Pakistan (2.18%), and Russia (2.17%). Among the total survey respondents, 1.64% stated that English was their native language, and 1.20% indicated that English was the language they spoke the best; these participants were removed from the sample. 35% of the students indicated that they had spent some time in a region where English is spoken often. The table below shows student demographic information for survey participants including age, gender, and education level.


At the end of the survey, we asked the students the following question:

"How can we change MOOCs to help students who are non-native English speakers?"

I will be analyzing the response to that question in this tutorial today. About 8041 students entered at least 1 character in the text box for the question above. The dataset is messy in several ways:

1. Some students entered punctuation only, so it needs to be cleaned
2. Some of the responses are in other languages (e.g Spanish), so they will need to be translated to English
3. Many of the responses have random ascii characters, by virtue of students who live abroad and have non-English computer keyboards, so that needs to be dealt with
4. A lot of the English responses are grammatically incorrect so that may limit strength of the inferences we can draw from the data
5. Most importantly, we need to analyze the text corpus using different ML python libraries to gain some insight about the kinds of design interventions that English Language Learners feel are beneficial. 


To begin, we will start by importing the dataset:


In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.DataFrame(columns = ['text', 'english_speaker'])
df = pd.read_csv('dataset.csv')

In [3]:
### AUTOLAB_IGNORE_START
#print(df)
### AUTOLAB_IGNORE_STOP

## Data Cleaning

Now that the dataset is imported, it gives us an opportunity to get an overview of what the data looks like. It looks like some students entered only numbers, punctuation only, etc as their responses. As a first step in the data cleaning process, we will change all responses to lowercase, delete all ' characters (e.g don't becomes dont), and remove all responses that don't contain at least one alphabet character. Finally, remove all the leading and trailing spaces.  

In [4]:
for index, row in df.iterrows():
    text = str(row['text'])

    if(re.search('[a-zA-Z]', text)):
        row['text'] = text.lower().replace('\'', '').strip()
        
    else:
        row['text'] = ""

### AUTOLAB_IGNORE_START
#print(df)
### AUTOLAB_IGNORE_STOP

Another really nice way of cleaning data is by converting multiple white spaces into a single space, and separating all punctuation from letters/numbers by adding a space before and after. 

In [5]:
#iterate through the dataframe, get each text response, remove additional spaces, then add space before and after punctuation
for index, row in df.iterrows():
    text = str(row['text'])
    
    # replace all whitespace with a single space
    text = re.sub(r"\s+", " ", text)

    # then add spaces before all punctuation except hyphen, so they are separate tokens
    punctuation = set(re.findall(r"[^\w\s]+", text)) - {"-"}
    for c in punctuation:
        text = text.replace(c, " "+c+" ")
    
    #replace old text with formatted text
    row['text'] = text


Every data set is unique - I always find it useful to export my data to a csv, sort the data alphabetically, and identify phrases that are obviously not useful. This next step exports the code to a file called cleaned_dataset.csv

In [6]:
### AUTOLAB_IGNORE_START
df.to_csv("cleaned_dataset1.csv", sep=',', encoding='utf-8')
### AUTOLAB_IGNORE_STOP

Given that we asked we expect for the students to give us actual design recommendations, we expect that they give fairly lengthy responses. Also, based on our analysis of the csv file, we observed that the vast majority of the responses with less than 20 characters were not informative enough for our reseach purposes. Therefore, we will elimimate all responses less than 21 characters, and create a smaller dataframe without the blank responses. 

In [None]:
for index, row in df.iterrows():
    text = str(row['text'])

    if(len(text)<21):        
        row['text'] = ""
        
df = df[df['text'] != '']

### AUTOLAB_IGNORE_START
#print(df)
### AUTOLAB_IGNORE_STOP

This step eliminated over 2000 rows in the data which increasing the likelihood of obtaining insights that are more detailed and actionable. 

## Processing Text Corpus with Multiple Languages

Given that our respondents are non native English speakers, there is a very high probability that some of their responses are in languages other than English. I found two different python libraries that allow for language translation. Google's [langdetect](https://pypi.python.org/pypi/langdetect) library provides functionality that reads a string, and returns the likely language of the text (e.g. English). This library can currently detect 55 languages (same languages as google translate). To install this library, run "pip install langdetect" in your Anaconda shell (there is no conda install equivalent). In this next step, I am going to create a new field in my dataframe called "text_language" that holds the returned language of the text.

In [None]:
#import the library
from langdetect import detect

#create a new column in the dataframe
df['text_language'] = None

#iterate through the dataframe, detect the language and save it. 
for index, row in df.iterrows():
    text = str(row['text'])
    
    #detect the language if possible and save it in text_language
    row['text_language'] = detect(text)

In [None]:
### AUTOLAB_IGNORE_START
#print(df)
### AUTOLAB_IGNORE_STOP 

To get a general picture of the language variety in our corpus, we will do a language count in the entire dataframe:

In [None]:
#create an empty dictionary to store the language counts
language_count = {}
english_count = 0
other_language_count = 0

#iterate through the dataframe, and count the number of languages the tool detected. 
for index, row in df.iterrows():
    text = str(row['text_language'])
    
    if text in language_count:
        language_count[text] = language_count[text] + 1
    else:
        language_count[text] = 1
    
    if text == "en":
        english_count += 1
    else:
        other_language_count += 1
    


In [None]:
### AUTOLAB_IGNORE_START
print(language_count)
print("\n\nEnglish: " + str(english_count))
print("Other Languages: "+ str(other_language_count))
df.to_csv("cleaned_dataset.csv", sep=',', encoding='utf-8')
### AUTOLAB_IGNORE_STOP 

All together, this library shows that there are 392 text responses that need to be translated. Potential solutions to this problem includes finding native speakers for all languages, putting all the responses on Mechanical Turk and have crowd workers translate it, copy and paste each one into Google translate, or find a python library that can automatically provide translations (which thankfully exists). 

Google also has a [translation API](https://pypi.python.org/pypi/googletrans) that allows for automatic translation of language if you provide a source and target language. Without specifying a source language, it tries to detect the source language before it translates the text. To install, run "pip install googletrans" in your conda shell. The library can be used as follows: 


In [None]:
#import and initialize translator
from googletrans import Translator
translator = Translator()

#first, save the original survey responses before translation
df['old_text'] = df['text']

#then create a column to hold if it it gets translated
df['translated'] = None

original = 0
translated = 0

#iterate through the dataframe, check if text is not english,  translate it.

for index, row in df.iterrows():
    lang = str(row['text_language'])
    
    txt = str(row['text'])
    
    if lang == "en":
        original += 1
        row['translated'] = "Original"
    else:
        new_text = translator.translate(txt, src=lang, dest='en').text
        if detect(new_text) == 'en': 
            translated += 1
            row['translated'] = "Translated"
            row['text'] = new_text      
        else:
            original += 1
            row['translated'] = "Original"
            row['text_language'] = 'en'

In [None]:
### AUTOLAB_IGNORE_START
print("\n\nOriginal Count: " + str(original))
print("Translated Count: "+ str(translated))
df.to_csv("cleaned_dataset.csv", sep=',', encoding='utf-8')

df2 = df[df['translated'] == "Translated"]

for index, row in df2.iterrows():
    print(str(row['old_text']), ' -> ', str(row['text']))
### AUTOLAB_IGNORE_STOP 

Reviewing the results above, it looks like responses in other languages not only got translated, but a few English responses with typos also got "translated" to fix the typos. This almost makes me want to run the entire text corpus through the translator as most of the respondents are non-native English speakers therefore most of the corpus is likely to be covered in typos. Otherwise, the results are very impressive. For the rest of this tutorial, I am going to be utilizing some the free text mining tools that were covered in class to try to gain insights on the design recommendations that the students shared. 

## Free Text Analysis

In Homework 3, we covered different methods of analyzing free text data including the bag-of-words approach, n-grams, cosine similarity, perplexity etc. A simple bag of words approach is not useful in this case because regardless of the word frequency, we cannot extract any insights from the corpus with individual words. However, for my purposes, I wanted to really wanted to extract the common topics that emerged from the corpus to gain insights on the interventions that are potentially helpful. 

I decided to use a topic extraction algorithm to get more comprehensible insights from the data. After some research, I found that the Python library ‘Scikit Learn’ (we have used this severally in past assignments), has two different topic modelling algorithms called Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF).

For a nice summary on how these work, please visit this blog [post](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730). 

Although both algorithms perform topic modeling, NDM has been shown to produce more use insights in smaller data sets. This library requires that you specify both the number of topics and the words per topic to be extracted. I found that 5 words per topic, and 30 topic total produced meaningful results without being extremely repetitive for my particular text. The section below shows how I used these libraries to extract the topics that were common in the model. 


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
from sklearn.decomposition import NMF, LatentDirichletAllocation

def preprocessing():
    
    #take into English stop words 
    tf_vectorizer = CountVectorizer(stop_words='english')
    
    tf = tf_vectorizer.fit_transform(df["text"].tolist())
    tf_feature_names = tf_vectorizer.get_feature_names()

    #set number of topics to 20 total, and number of top words that make up a topic to 10. Play with these numbers to see different results
    no_topics = 30
    no_top_words = 5

    #build the topic model with the parameters and your text corpus
    lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

    #return the model and its features
    return lda, tf_feature_names, no_top_words

In [None]:
model, feature_names, no_top_words = preprocessing()

### AUTOLAB_IGNORE_START

#print out each topic produced by the model
for topic_idx, topic in enumerate(model.components_):
    print("Topic %d:" % (topic_idx))
    print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
    
### AUTOLAB_IGNORE_STOP 

It is almost impossible to manually go through really large text responses and infer their meaning without a very large team, and a well defined coding dictionary. These topic modeling algorithms are extremely important in data science as they at least give us enough insight about the kinds of topic that might emerge, and can help with the creation of a coding dictionary, before the entire corpus is dispersed to the team for analysis. 

Some of these topics that the model produced are more insightful that others. The most obvious things I learned from this model are as follows: 

1. A lot of these students express appreciation and complements for courses as is (Topic 1)
2. Students say they can benefit from instructors speaking more slowly and enunciating clearly (Topic 2,22)
3. They can benefit from clearer and more practical instructions and examples (8)
4. There is a highly expressed need for better translators and dictionaries (11,12,17,19)
5. Language subtitles are beneficial (14)
6. They can benefit for an increase in face-face interactions in person or through video tools (24,25) - there is already research evidence for this. Please see [Kulkarni et al. Talkabout](https://hci.stanford.edu/publications/2015/PeerStudio/cscw237-kulkarni.pdf) system. 
7. There is room for other kind of multimedia instruction (18)
8. They can benefit from conversation with native English speakers (27)
