## Lab 6: Text Analysis and Natural Language Processing 

In this lab, we explore the text data provided by Kiva's API. Our primary source of textual data is the descriptive texts that borrowers submit for a loan request and are posted publicly on the Kiva website. Kiva is unique in that often, borrowers do not write descriptive requests for themselves, but fill out a questionnaire to Kiva's team of volunteer translators. We try to leverage this body of text (also called a *"corpus"*) to see if we can see any patterns in how an individual translator writes a description.

As always, we first import our packages and read in our data below. 

In [1]:
import pandas as pd
import numpy as np

# NLP-specific packages: 
import nltk
from nltk.corpus import names
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim import corpora
from nltk.corpus import names
from nltk.tokenize import word_tokenize
from nltk.text import Text  
from nltk.stem import PorterStemmer


# output of multiple commands in a cell will be output at once.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

Slow version of gensim.models.doc2vec is being used


In [2]:
#datapath = '~/intro_to_machine_learning/data'
datapath = '~/Desktop'
df = pd.read_csv(datapath+'/df.csv', low_memory=False)

## Exploratory Analysis and Feature Engineering

We have very limited information about translators. In fact, the only variable in our dataset relevant to translators is their name! What information can we extract from this field? 

In text analysis, a common simple task is how to categorize names by gender. We know, just in our daily knoweldge of English names, that names that end in -a are likely to be female, and names that end in -o are likely to be male (for example, Jenna and Pablo). Since we have both the gender data and the name data for the borrowers, let's use borrowers' data to train a classifier model that can predict the gender from a name! Then, we will apply this model to the translators names to predict their genders. 

Here, we use the Naive Bayes Classifier (for a comprehensive review, take a look back at Module 6.) This algorithm assigns a label (in our case, "male" or "female") using the last letter of the name provided in the data. Remember that we first need to clean our data to ensure that we are capturing the last letter of first names. 

In [3]:
#create name and gender dataframe for single borrowers
kiva_names = df[['name', 'gender', 'borrower_count']]
kiva_names = kiva_names[['name', 'gender']][kiva_names['borrower_count'] == 1]

kiva_names.sample(15)
len(kiva_names)

Unnamed: 0,name,gender
65751,Kavumbi,Female
100521,Nancy,Female
19986,Daisy,Female
19661,Benson,Male
25753,Mwatime,Female
22788,Mkambe,Female
106072,Kyalo,Male
44531,Amina,Female
20833,Amani,Female
94110,Grace,Female


105297

Here we see there are some instances in which the name is not an individual's first name, but rather the name of a business or a collective, or "Anonymous". Let's drop these out of our training dataset as they won't be helpful in determining the gender of a person. 

Let's also select only the first name. 

In [5]:
# rm null values, anonymous, and duplicates

kiva_names = kiva_names.loc[kiva_names['name'].isnull() == False]
kiva_names = kiva_names.drop_duplicates()
kiva_names = kiva_names[kiva_names['name'] != "Anonymous"]
kiva_names['name'] = kiva_names['name'].str.split(expand=True)[0]

len(kiva_names['name'])
kiva_names['name'].head(15)

9794

0     Evaline
1      Julias
2        Rose
3        Jane
4       Alice
5       Clare
6        Mary
7       James
8     Jacinta
9       Emily
10     Fridah
11    Charity
12      Susan
13      Joyce
14     Daniel
Name: name, dtype: object

Now let's define a function that will return the last letter of our borrowers' first names. This letter will be a **feature** we will use to attempt to predict the output feature, gender. 

In [6]:
#function that returns last letter of first name 
def gender_features(name):
    return {'last_letter': name[-1]}

Now let's prepare to train our model. We split train and test sets as usual. 

In [7]:
# Set training-test split %
split_pct = 0.80

# Remove null and NaN values 
kiva_names = kiva_names[pd.notnull(kiva_names)]

# the pandas command "sample" already randomizes its selection. 
kiva_names_shuffled = kiva_names.sample(frac=1)

kiva_train_set = kiva_names_shuffled[:int((len(kiva_names_shuffled)*split_pct))] 
kiva_test_set = kiva_names_shuffled[int(len(kiva_names_shuffled)*split_pct+1):]  

len(kiva_train_set.index)
len(kiva_test_set.index)

7835

1958

Now we prepare our data by converting the name and gender features from features into lists, so they are associated with each other. 

In [8]:
kiva_female_train = kiva_train_set[kiva_train_set['gender'] == "Female"]
kiva_male_train = kiva_train_set[kiva_train_set['gender'] == "Male"]
kiva_female_test = kiva_test_set[kiva_test_set['gender'] == "Female"]
kiva_male_test = kiva_test_set[kiva_test_set['gender'] == "Male"]

kiva_train_feature_set = [(name, "female") for name in kiva_female_train['name']] + \
[(name, "male") for name in kiva_male_train['name']]

kiva_test_feature_set = [(name, "female") for name in kiva_female_test['name']] + \
[(name, "male") for name in kiva_male_test['name']]

In [9]:
kiva_train_feature_set = [(gender_features(n), g) for (n, g) in kiva_train_feature_set]
kiva_test_feature_set = [(gender_features(n), g) for (n, g) in kiva_test_feature_set]

In [10]:
kiva_classifier = nltk.NaiveBayesClassifier.train(kiva_train_feature_set)

In [11]:
#let's test out our new classifier! 

kiva_classifier.classify(gender_features('Cleopatra'))
kiva_classifier.classify(gender_features('Maximillian'))
kiva_classifier.classify(gender_features('James'))

'female'

'male'

'male'

It looks like it works okay for our three samples, but let's get a better sense of overall accuracy.

The nltk "accuracy()" method returns the % of time our predictions are accurate

In [12]:
#Find out which features were most informative in determining outcome

kiva_classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'w'              male : female =      9.4 : 1.0
             last_letter = 'k'              male : female =      8.9 : 1.0
             last_letter = 'f'              male : female =      8.3 : 1.0
             last_letter = 'p'              male : female =      3.9 : 1.0
             last_letter = 's'              male : female =      3.5 : 1.0


Show most informative features: this returns LIKELIHOOD RATIOS. For the first entry "f", we see that males are more likely to have this letter as their last letter by a factor of 9.3x.

But how accurate is this? Let's run this classifier on our test dataset. 

In [13]:
#Get a sense of overall accuracy

print(nltk.classify.accuracy(kiva_classifier, kiva_test_feature_set))

0.6705822267620021


This prediction is okay, but not amazing. Remember that a random generator of genders would likely get an accuracy of about 50%, so at least we are better than random. One potential hypothesis for why we are not better at classifying genders might be because this particular dataset mixes Kenyan and American first names. Whereas you might expect an American female name to end in -a and an American male name to end in -o (e.g. Jenna and Julio), these conventions do not necessarily hold for Kenyan names. 

Since we see that the translators have primarily American names, let's try training a model using a corpus of American names.  

In [14]:
nltk_labeled_names = ([(name, "male") for name in names.words("male.txt")] +
                [(name, "female") for name in names.words("female.txt")])

nltk_feature_sets = [(gender_features(n), gender)
                for (n, gender) in nltk_labeled_names]

    # Divide the feature sets into training and test sets
nltk_train_set, nltk_test_set = nltk_feature_sets[500:], nltk_feature_sets[:500]

    # Train the naiveBayes classifier
nltk_classifier = nltk.NaiveBayesClassifier.train(nltk_train_set)

    # Test out the classifier with few samples outside of training set
print(nltk_classifier.classify(gender_features("neo")))  # returns male
print(nltk_classifier.classify(gender_features("trinity")))  # returns female

    # Test the accuracy of the classifier on the test data
print(nltk.classify.accuracy(nltk_classifier, nltk_test_set)) 

    # examine classifier to determine which feature is most effective for
    # distinguishing the name's gender
print(nltk_classifier.show_most_informative_features(5))

male
female
0.602
Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0
None


Interestingly, our model using the Kiva data gets a slightly higher accuracy score. Let's use this model instead to try to predict translators' genders. 

In [15]:
translators = pd.DataFrame()
translators['translator_first_name'] = df['translator.byline'].str.split(expand=True)[0]

# rm null values and duplicates
translators = translators.loc[translators['translator_first_name'].isnull() == False]
translators = translators.drop_duplicates()

translators.head(5)

Unnamed: 0,translator_first_name
0,Julie
1,Morena
8,Lynn
19,Mohammad
21,Cheryl


In [16]:
translators['last_letter'] = translators['translator_first_name'].apply(lambda x: gender_features(x))
translators_last = translators['last_letter']
translators_last[0:5]

0     {'last_letter': 'e'}
1     {'last_letter': 'a'}
8     {'last_letter': 'n'}
19    {'last_letter': 'd'}
21    {'last_letter': 'l'}
Name: last_letter, dtype: object

In [17]:
translators['gender'] = translators_last.apply(lambda x: kiva_classifier.classify(x))
translators.head(10)

Unnamed: 0,translator_first_name,last_letter,gender
0,Julie,{'last_letter': 'e'},female
1,Morena,{'last_letter': 'a'},female
8,Lynn,{'last_letter': 'n'},male
19,Mohammad,{'last_letter': 'd'},male
21,Cheryl,{'last_letter': 'l'},male
23,Rita,{'last_letter': 'a'},female
25,Maureen,{'last_letter': 'n'},male
29,Lorne,{'last_letter': 'e'},female
31,Caty,{'last_letter': 'y'},female
34,Trishna,{'last_letter': 'a'},female


Interesting - even in this small sample of 10, we see that the accuracy rate is far from perfect. Using our own understanding of what gender we would assign the names we see, this sample has an accuracy score of 60%. Not great.  

**How can we make this prediction better? Can you think of other aspects of a name might be predictive of gender?** 
A quick test we can try is using the final two letters of a name instead of just one. Try it! 

We just completed our first supervised learning exercise: classification. Let's move forward in our question to finding patterns in the descriptions of the loans by translators, our unsupervised learning exercise. First we need to clean the text data: 

## Cleaning text 

Cleaning text is almost always required in text analysis. You have already gotten a taste of this in this notebook when you cleaned the variable "name" to exclude business names, and in past notebooks as well. 

Cleaning can be as extensive as you want it to be, depending on what serves your research question the best. Is it best to look at full sentences, so you can retain the context of words? Is it best to look at individual words? Should you remove grammar, HTML code, stopwords? 

Before answering this question, we have to know what's in our data. Let's turn to some exploratory analyses to determine how we should clean our data.

Note that we don't run the following snippets of code on the whole dataset as text analysis is very computationally expensive and may crash your computer. Instead, we draw a sample of 100 descriptions from the dataset. *This means that your results will look slightly different, but that's okay -- make sure to post on Slack anything you find interesting!*  

In [94]:
# read all non-null text into a single df
text_raw = df['description.texts.en'][df['description.texts.en'].isnull() == False]

# take sample of 100 entries, read into list
sample_num = 100
text_raw_abridged = text_raw.sample(sample_num)
text = list(map(str, text_raw_abridged))

print(text[0:3]) # Each sentence is an item in the list

['Mercy is a widow and blessed with three children who are still in school.  She runs a hardware shop to support her  family. She has been in this business for three years.  She also runs a green grocer to earn extra income.  She has employed one person to help her manage the business.  \r\r\n\r\r\nMercy is requesting for a loan of 20,000 Kenyan shillings to buy paints and bolts for resale. In the next 5 years, she wants to expand her business.', 'Jackson is the father of seven between the ages of 23 and 6 and lives with his family in his hometown area of Nyamira, North Rift, Kenya. He has two farmhands and his farm produces tea, coffee, bananas and maize. He makes additional income from his trade in animals within the local market.\r\r\n\r\r\nJackson has requested a loan of 35,000 KES from Juhudi Kilimo to purchase and insure a dairy cow. He says he will use the income to pay for the education of his children and raise their living standards. He plans to own five cows and construct a 

We see there is some HTML/CSS cluttering up the text. Below, we remove these and convert all capital letters to lowercase.

In [49]:
# Remove HTML 
text = [w.replace('\r', '') for w in text]
text = [w.replace('\n', '') for w in text]
text = [w.replace('<br />', '') for w in text]
text = [w.replace('.', '') for w in text]
text = [w.replace(',', '') for w in text]

# Lowercase
text = [w.lower() for w in text]

print(text[0:3])

['mercy is a widow and blessed with three children who are still in school  she runs a hardware shop to support her  family she has been in this business for three years  she also runs a green grocer to earn extra income  she has employed one person to help her manage the business  mercy is requesting for a loan of 20000 kenyan shillings to buy paints and bolts for resale in the next 5 years she wants to expand her business', 'jackson is the father of seven between the ages of 23 and 6 and lives with his family in his hometown area of nyamira north rift kenya he has two farmhands and his farm produces tea coffee bananas and maize he makes additional income from his trade in animals within the local marketjackson has requested a loan of 35000 kes from juhudi kilimo to purchase and insure a dairy cow he says he will use the income to pay for the education of his children and raise their living standards he plans to own five cows and construct a modern dairy unit for his animalsjackson is

Great! The text looks clean. We also notice that this dataset is a list where every item in the list is a description. Now we tokenize each item in the list so that each word is separated out. This yields a list of lists. 

In [50]:
tokens = list(map(word_tokenize, text))
kiva_text = nltk.Text(tokens)
kiva_text[0:2]

[['mercy',
  'is',
  'a',
  'widow',
  'and',
  'blessed',
  'with',
  'three',
  'children',
  'who',
  'are',
  'still',
  'in',
  'school',
  'she',
  'runs',
  'a',
  'hardware',
  'shop',
  'to',
  'support',
  'her',
  'family',
  'she',
  'has',
  'been',
  'in',
  'this',
  'business',
  'for',
  'three',
  'years',
  'she',
  'also',
  'runs',
  'a',
  'green',
  'grocer',
  'to',
  'earn',
  'extra',
  'income',
  'she',
  'has',
  'employed',
  'one',
  'person',
  'to',
  'help',
  'her',
  'manage',
  'the',
  'business',
  'mercy',
  'is',
  'requesting',
  'for',
  'a',
  'loan',
  'of',
  '20000',
  'kenyan',
  'shillings',
  'to',
  'buy',
  'paints',
  'and',
  'bolts',
  'for',
  'resale',
  'in',
  'the',
  'next',
  '5',
  'years',
  'she',
  'wants',
  'to',
  'expand',
  'her',
  'business'],
 ['jackson',
  'is',
  'the',
  'father',
  'of',
  'seven',
  'between',
  'the',
  'ages',
  'of',
  '23',
  'and',
  '6',
  'and',
  'lives',
  'with',
  'his',
  'family

## Preliminary investigations / visualizations 

Now that we've got cleaned data, let's conduct some preliminary investigations. Frequency, concordance and similar are all functions of the NLTK package that can give us a sense of what is in our text without our having to read every single line.

- Frequency
- Concordance
- Similar 

Frequency returns a list of unique words, with how often each word shows up in the corpus. This provides an idea of what words are included in the descriptions of loan requests in Kenya. Note that the most common words are relatively uninformative, such as "to," "and," or "is." Later we will remove these for analysis so they do not overinfluence our results. 

In [51]:
# Read all sentences into single list 

text_corpus = list() 

for x in range(0, len(kiva_text)): 
    text_corpus.extend(kiva_text[x])

text_corpus = nltk.Text(text_corpus)

In [52]:
#kiva_fdist.plot()
#kiva_fdist.plot(50, cumulative=True)
kiva_fdist = nltk.FreqDist(text_corpus)
kiva_fdist.most_common(25)

[('to', 492),
 ('and', 432),
 ('a', 388),
 ('the', 375),
 ('she', 337),
 ('is', 303),
 ('her', 297),
 ('of', 294),
 ('in', 228),
 ('for', 194),
 ('has', 186),
 ('business', 183),
 ('loan', 152),
 ('he', 151),
 ('his', 146),
 ('will', 136),
 ('years', 134),
 ('with', 117),
 ('this', 110),
 ('children', 102),
 ('that', 87),
 ('been', 85),
 ('from', 80),
 ('be', 70),
 ('married', 67)]

Concordance takes an input word of your choosing and returns the surrounding words. This provides important context about how a specific word is used in the text corpus. Here, we test "future", "seasonality", and "working". Note that sme of these words are used differently or ambiguously. This gets at an important point for NLP - words can be and are used ambiguously and it is difficult to parse meaning unless we also take a look at context.

In [23]:
text_corpus.concordance('future')

Displaying 25 of 29 matches:
ermanent home for his family in the future tabu is a married woman with three 
ithin 5 years she hopes that in the future , she will have improved living sta
ls for resale she hopes that in the future she will be successful this is her 
 dvd players , tv sets , etc in the future , he wants to educate his children 
ishing another business in the near future isaac does farming in thika town an
 term from visionfund kenya and his future hopes are to expand his farming bus
hool will help them have a brighter future lend $ 25 towards this loan and emp
ithin 5 years she hopes that in the future , she will live a comfortable life 
ess to open up a general shopin the future , he hopes to have a wholesale busi
usinessher hopes and dreams for the future are to have a good inventory and in
o be a clothes and shoe supplier in future with a five-year plan of opening mo
ithin 5 years she hopes that in the future , she will be a successful business
 profits to improve her

In [24]:
text_corpus.concordance('seasonality')

Displaying 5 of 5 matches:
e cites her major challenge to be seasonality mariam dreams of establishing a b
by she faces a major challenge of seasonality in her business she dreams to exp
ine’s major business challenge is seasonality she owns a house without electric
public her primary challenges are seasonality and lack of enough capital to ens
the challenges of competition and seasonality in her business with the kshs 30,


In [25]:
text_corpus.concordance('working')

Displaying 12 of 12 matches:
to start dairy farming he is a hard working man who is determined to achieve hi
hardworking individual she has been working alongside one acre fund since 2013 
 hardworking individual he has been working alongside one acre fund since 2014 
business challenge to be inadequate working capital she will use the kes 60,000
business challenge to be inadequate working capital she will use the 30,000 kes
ve five children : hillary , 26 and working ; lucy , 18 and in form two ; lilli
business challenge to be inadequate working capital she will use the kes 50,000
hant two years ago with the goal of working on her own to support her household
very resourceful person he has been working alongside one acre fund since 2015 
the people make their livelihood by working as laborers in the agricultural fie
business challenge to be inadequate working capital she will use the kes 60,000
strict < p > in 2010 hellen started working with the one acre fund she decided 


Similar takes in an input word of your choosing, but returns other words that appear in a similar range of contexts. This is called finding the "distributional similarity." Most similar words appear first. 

In [26]:
text_corpus.similar("future")

loan kshs school business father ages area kes purchase use education
loans part home challenge goal photo group training leader


In [27]:
text_corpus.similar("children")

business years farm neighbors family income home village community in
the loan own cows farming wife money educate married selling


### Remove stop words

"Stop words" are words like "to", "the", "a" - words that are plentiful but do not offer any meaningful information about the document. Here, we import a predetermined set of stop words defined by the NLTK package and then remove them from the dataset. The resulting dataset has words that we can generally agree are meaningful and say something about the content of the loan request. You can also define your own set of "stop words" to remove if you have a very specific set of words you want to remove. 

However, we see that these words still have suffixes such as "-s" and "-ing". We want to remove these because if we do not, the algorithm will count a set of words like "married" and "marries" as different words, when we can consider them, for our purposes, the same word. To remove these, we stem our text data. 

In [69]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [53]:
#remove stop words

text_corpus_clean = [word for word in text_corpus if word not in stopwords.words('english')]
text_corpus_clean[0:50]

['mercy',
 'widow',
 'blessed',
 'three',
 'children',
 'still',
 'school',
 'runs',
 'hardware',
 'shop',
 'support',
 'family',
 'business',
 'three',
 'years',
 'also',
 'runs',
 'green',
 'grocer',
 'earn',
 'extra',
 'income',
 'employed',
 'one',
 'person',
 'help',
 'manage',
 'business',
 'mercy',
 'requesting',
 'loan',
 '20000',
 'kenyan',
 'shillings',
 'buy',
 'paints',
 'bolts',
 'resale',
 'next',
 '5',
 'years',
 'wants',
 'expand',
 'business',
 'jackson',
 'father',
 'seven',
 'ages',
 '23',
 '6']

### Stem words 

The Porter Stemmer is one of several stemming tools (including Snowball Stemmer and the Lancaster Stemmer). Each type of stemmer uses different rules to "stem" a word like "running" to "run". Here we use the Porter Stemmer as it is very commonly used. Try others! 

In [54]:
# Clean data - stem
# Porter stemmer is one of several

porter = nltk.PorterStemmer()
[porter.stem(t) for t in text_corpus_clean]

['merci',
 'widow',
 'bless',
 'three',
 'children',
 'still',
 'school',
 'run',
 'hardwar',
 'shop',
 'support',
 'famili',
 'busi',
 'three',
 'year',
 'also',
 'run',
 'green',
 'grocer',
 'earn',
 'extra',
 'incom',
 'employ',
 'one',
 'person',
 'help',
 'manag',
 'busi',
 'merci',
 'request',
 'loan',
 '20000',
 'kenyan',
 'shill',
 'buy',
 'paint',
 'bolt',
 'resal',
 'next',
 '5',
 'year',
 'want',
 'expand',
 'busi',
 'jackson',
 'father',
 'seven',
 'age',
 '23',
 '6',
 'live',
 'famili',
 'hometown',
 'area',
 'nyamira',
 'north',
 'rift',
 'kenya',
 'two',
 'farmhand',
 'farm',
 'produc',
 'tea',
 'coffe',
 'banana',
 'maiz',
 'make',
 'addit',
 'incom',
 'trade',
 'anim',
 'within',
 'local',
 'marketjackson',
 'request',
 'loan',
 '35000',
 'ke',
 'juhudi',
 'kilimo',
 'purchas',
 'insur',
 'dairi',
 'cow',
 'say',
 'use',
 'incom',
 'pay',
 'educ',
 'children',
 'rais',
 'live',
 'standard',
 'plan',
 'five',
 'cow',
 'construct',
 'modern',
 'dairi',
 'unit',
 'animals

## Algorithms: Latent Dirichlet Allocation

We've got a clean data set of text! Cleaned, tokenized, and stemmed. Now, let's try turning to our unsupervised model: Latent Dirichlet Allocation, which models topics in a document. 

The Latent Dirichlet Allocation model looks at text in a "bag of words" form, which is the simplest representation of text. Recall that "bag of words" means that all the words in a text corpus is counted and put into a dictionary. This is a convenient way to deal with text, but one downside is that no context is retained. In a "bag of words" representation, the sentence "the man ate bread" is considered the same as "the bread ate man". 

We will use the Python package "gensim" because this allows the model to be run on data that might exceed your machine's RAM. This will be important for your own NLP algorithms, as they are typically computationally expensive. 

In [91]:
kiva_text_clean = [None]*100

for i in range(0, len(kiva_text_clean)):
    test[i] = [word for word in kiva_text_clean[i] if word not in stopwords.words('english')]

In [95]:
#Clean data 

# initialize empty list, length = num of descriptions 
kiva_text_clean = [None]*sample_num
for i in range(0, sample_num):
    kiva_text_clean[i] = [word for word in kiva_text[i] if word not in stopwords.words('english')]

#[porter.stem(t) for t in kiva_text_clean]

# Creating the term dictionary of our corpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(kiva_text_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in kiva_text_clean]

In [100]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=50)

In [101]:
print(ldamodel.print_topics(num_topics=2, num_words=5))

[(0, '0.017*"farming" + 0.015*"loan" + 0.010*"group" + 0.010*"income" + 0.009*"children"'), (1, '0.037*"business" + 0.023*"years" + 0.021*"loan" + 0.015*"children" + 0.012*"married"')]


Results: 
Farming loans / family business loans? 



**DEVNOTES** 
Ideas for research: 
Try to cluster description based on who the translator is? (need to explain what tf-idf is)
Try to parse out all adjectives - see what that looks like per translator ?

In [102]:
# Other clustering algos using nltk

help(nltk.cluster)

Help on package nltk.cluster in nltk:

NAME
    nltk.cluster

DESCRIPTION
    This module contains a number of basic clustering algorithms. Clustering
    describes the task of discovering groups of similar items with a large
    collection. It is also describe as unsupervised machine learning, as the data
    from which it learns is unannotated with class information, as is the case for
    supervised learning.  Annotated data is difficult and expensive to obtain in
    the quantities required for the majority of supervised learning algorithms.
    This problem, the knowledge acquisition bottleneck, is common to most natural
    language processing tasks, thus fueling the need for quality unsupervised
    approaches.
    
    This module contains a k-means clusterer, E-M clusterer and a group average
    agglomerative clusterer (GAAC). All these clusterers involve finding good
    cluster groupings for a set of vectors in multi-dimensional space.
    
    The K-means clusterer starts 