# a4 - Data Analysis
Fill in the below code cells as specified. Note that cells may utilize variables and functions defined in previous cells; we should be able to use the `Kernal > Restart & Clear All` menu item followed by `Cell > Run All` to execute your entire notebook and see the correct output.

## Part 1. Numbers
For this part of the assignment, you will analyze some numeric data (counts of library holdings) to investiate how the distribution of numbers in natural data sets obeys the counter-intuitive [Benford's Law](https://plus.maths.org/content/os/issue9/features/benford/index). 

<small>(This exercise was adapted from Steve Wolfman).</small>

Create a variable **`holdings_data`** which is a **list** of the contents of the **`data/libraryholdings.txt`** file included in the repository (each line in the file should be a single element in the list). You will need to open up the file and read its contents into a list. You should specify a _local path_ to the file (from this notebook's location).

In [1]:
# Create holdings_data list as per the instructions
holdings_data = []

# Read the data from the file at path 'data/libraryholdings.txt'
with open('data/libraryholdings.txt') as holdings_file:
    for line in holdings_file:
        holdings_data.append(line)

Print out the first **ten** items from the `holdings_data` list, each on its own line. (Note that there may be extra line breaks that are included in the data items themselves).

In [2]:
# Print the first ten items from the holdings_data
for i in range(10):
    print(holdings_data[i], end = '')

(* Library holdings (# of books in each library), *)
(* collected by Christian Ayotte.                 *)
(* Labels not available.                          *)

12201
600778
14926
37863
14866
9896


Use the **slice operator (`:`)** to remove the "heading" and blank elements from the beginning of the data list, leaving only the list of numbers. The remaining values should continue to be stored (re-stored) in the `holdings_data` variable. Output the new first element in `holdings_data` to demonstrate that it is the first number in the data set.
- The values in the list _should_ be strings rather than an integers

In [3]:
# Create a temporary list to store the sliced holdings_data
temp_holdings_data = list(holdings_data[4:])

# Re-store the holdings_data variable with the temporary list created above
holdings_data = list(temp_holdings_data)

# Print the first element of holdings_data
print(holdings_data[0], end = '')

12201


Create a variable **`lead_digit_counts`** that is a dictionary whose keys are _strings_ of each digit (`"0"`, `"1"`, `"2"`, etc.), and whose values are all the number `0`. You can do this directly or with a loop. Print out the variable after you create it.

In [4]:
# Create a variable lead_digit_counts that is a dictionary
lead_digits_counts = {}

# Assign the values in the dictionary accordingly
for i in range(10):
    lead_digits_counts[str(i)] = 0
    
# Print the dictionary
print(lead_digits_counts)

{'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0, '9': 0}


Calculate the number of times each digit appears as the _first digit_ in a value of the `holdings_data` list, storing those counts in the `lead_digit_counts` dictionary.

In [5]:
# Create a loop to perform the task of extracting the first digit of each element of holdings_data
for i in range(len(holdings_data)):
    for k, v in lead_digits_counts.items():      # Loop through the dictionary to perform the digit comparison
        if k == holdings_data[i][0]:
            lead_digits_counts[k] += 1           # Increment the count upon match

Use a loop to print out each count in `lead_digit_counts` with the format:
```
X values have a leading digit of digit Y
```

In [6]:
# Loop to print out each count in lead_digits_counts
for k, v in lead_digits_counts.items():
    print(str(v) + " values have a leading digit of " + k)

0 values have a leading digit of 0
3056 values have a leading digit of 1
1606 values have a leading digit of 2
1018 values have a leading digit of 3
801 values have a leading digit of 4
640 values have a leading digit of 5
560 values have a leading digit of 6
502 values have a leading digit of 7
503 values have a leading digit of 8
452 values have a leading digit of 9


Print the _percentage_ of values in the the library holdings data set that have a leading digit **`1`** (round to 2 decimal places). Is this value as predicted by Benford's law?

In [7]:
import math

# Print percentage of values in the the library holdings data set that have a leading digit 1
print("Percentage of values in the the library holdings data set that have a leading digit 1:")
print(round(((lead_digits_counts['1'] / len(holdings_data)) * 100), 2))

# Print the expected percentage as per Benford's Law
print("Benford's law prediction for values that have a leading digit 1:")
print(round((math.log((1 + 1) / 1, 10)) * 100, 2))

Percentage of values in the the library holdings data set that have a leading digit 1:
33.44
Benford's law prediction for values that have a leading digit 1:
30.1


Yes, the value as predicted by Benford's law is close to the observed percentage for the holdings_data. 

***Extra credit challenge:*** Create a single variable `digit_position_counts` that contains the number of times that each digit 0 through 9 appears in _each_ position in the data set. E.g., a `1` appears in the 1st position 3056 times and in the second position 1005 times; a `2` appears in the 1st position 1606 times and in the second position 1044 times.

Use this variable to print a "table" of the percentage of the time each position contains each digit (e.g., the 1st digit is a `1` 33.44% of the time, a `2` 17.57% of the time, etc).

Note that for this extra challenge it is up to you to determine an appropriate data structure (e.g., how to combine dictionaries and lists and tuples) for representing this table. Be sure and include comments explaining your work.

Only attempt this problem once you have completed everything else!

## Part 2. Life Expectancy
For this part of the assignment, you'll work with data about the life expectancy (in years) for each country in the world in the years 1960 and 2013. Note that this can be really [fun](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) data!

The data is found in a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file: a plain-text data format where each line represents a record (row) of data and where feature (column) is separated by a comma.

Read in the contents of the **`data/life_expectancy.csv`** data file, and use it to construct a **list** called **`life_expectancy_list`**. Each element in this list should be a **dictionary** (one for each row in the `csv` file) with the following keys and values:

- a key `'country'` whose value is the name of the country (as a string)
- a key `'le_1960'` whose value is the life expectancy in 1960 (as a float)
- a key `'le_2013'` whose value is the life expectancy in 2013 (as a float)

Thus the first record should look like:
```
{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}
```

You should use the **`csv`** module to read this file and break up each row into different values. See [the documentation](https://docs.python.org/3/library/csv.html) for an example of how to do this. Print out the _first row_ of your list as a demonstration that you've processed the data correctly.

In [8]:
# Import the 'csv' module
import csv

# Create the empty list - 'life_expectancy_list'
life_expectancy_list = []

# Read the 'data/life_expectancy.csv' and populate the list
with open('data/life_expectancy.csv') as life_expectancy:
    reader = csv.DictReader(life_expectancy)
    for row in reader:
        # Append the list with values extracted from the .csv file
        life_expectancy_list.append({'country': row['country'], 'le_1960': float(row['le_1960']), 
                                     'le_2013': float(row['le_2013'])})

Add another item to each dictionary in the `life_expectancy_list` whose **key** is `change` and whose **value** is the change in life expectancy from 1960 to 2013.

In [9]:
# Loop through the list to create the 'change' key for each dictionary in 'life_expectancy_list'
for i in range(len(life_expectancy_list)):
    life_expectancy_list[i]['change'] = life_expectancy_list[i]['le_2013'] - life_expectancy_list[i]['le_1960']

Create a variable **`num_small_gain`** that stores the **number of countries** whose life expectancy did not improve by 5 years or more between 1960 and 2013. This will include counties whose life expectancy has worsened. Print out this variable.

In [10]:
# Create num_small_gain variable as 0
num_small_gain = 0

# Loop through the variable to figure out the countries with 'change' as less than 5
for i in range(len(life_expectancy_list)):
    if life_expectancy_list[i]['change'] < 5:
        # Increment the count where True
        num_small_gain += 1
        
# Print the number of countries whose life expectancy did not improve by 5 years or more between 1960 and 2013    
print(num_small_gain)

7


Create a variable **`most_improved`** that is the **name of the country** with the largest gain in life expectancy (between 1960 and 2013). Print out this variable.

In [11]:
# Create the variable 'most_improved' with the dictionary with maximum value for 'change'
most_improved = max(life_expectancy_list, key = lambda x: x['change'])

# Print the name of the country with the largest gain in life expectancy (between 1960 and 2013)
print(most_improved['country'])

Maldives


Define a function **`compare_country_le()`** that takes in the names of _two_ countries, and returns a **tuple** containing the following information:
- the name of the country with the greater life expectancy,
- the life expectancy in 2013 of that country
- the difference between the life expectancies in 2013

Use your function to print the comparison between the life expectancies of the _United States_ and _Cuba_.  

In [12]:
# Define the compare_country_le() function
def compare_country_le(country_one, country_two):
    """This function accepts the names of two countries, and returns a tuple containing the following information:

    - the name of the country with the greater life expectancy,
    - the life expectancy in 2013 of that country
    - the difference between the life expectancies in 2013"""
    
    country = ''
    life_expectancy_2013_one = 0
    life_expectancy_2013_two = 0
    
    # Loop through the life_expectancy_list to figure out the respective life expectancies
    for i in range(len(life_expectancy_list)):
        if life_expectancy_list[i]['country'] == country_one:
            life_expectancy_2013_one = life_expectancy_list[i]['le_2013']
        if life_expectancy_list[i]['country'] == country_two:
            life_expectancy_2013_two = life_expectancy_list[i]['le_2013']
            
    # Check to figure out the country with higher life expectancy        
    if life_expectancy_2013_one > life_expectancy_2013_two:
        return (country_one, life_expectancy_2013_one, (life_expectancy_2013_one - life_expectancy_2013_two))
    else:
        return (country_two, life_expectancy_2013_two, (life_expectancy_2013_two - life_expectancy_2013_one))
    
# Print the outcome of the function when we pass 'United States' and 'Cuba'    
print(compare_country_le('United States', 'Cuba'))

('Cuba', 79.23926829, 0.39780487999999536)


## Part 3. Readability
For this part of the assignment, you will calculate the [readability](https://en.wikipedia.org/wiki/Readability) of a text document using the [Dale-Chall Readability Formula](http://www.readabilityformulas.com/new-dale-chall-readability-formula.php). This method determines how "easy" it is to read a particular (English) document by considering the length of sentences and how many of the words used are "easy" to understand (based on a pre-defined list of "easy" words).

Splitting real-world text documents into words and sentences is non-trivial (English is hard!). To make this easier, you should use the [Natural Language Toolkit (nltk)](http://www.nltk.org/index.html) module. This module is included with Anacaonda, but does require some additional data source files to be installed on your computer. You _should_ be able to do this by running the below cell (you only need to run it once).

In [13]:
from nltk import download
download('punkt')
download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prate\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prate\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

You will also need to load the list of "easy" words into memory. This list can be found in the **`data/dale-chall.txt`** file. Open this file and read its entire contents into a **list** variable (e.g., `easy_words_list`), where each element in the list is a single line (word) in the file.

In [14]:
# Create the 'easy_words_list'
easy_words_list = []

# Read the 'data/dale-chall.txt' to extract the easy words
with open('data/dale-chall.txt') as dale_chall:
    for line in dale_chall:
        easy_words_list.append(line.strip())

In order to "look up" easy words, convert the easy words list into a **dictionary** (e.g., `easy_words_dict`), where each **key** is a word, and each **value** is `True` (that the word is in the list).
- Make sure you do not include newline characters in your keys!

In [15]:
# Create 'easy_words_dict' dictionary where each key is a word, and each value is True
easy_words_dict = dict(zip(easy_words_list[:], [True] * len(easy_words_list)))

Additionally, define a dictionary **`readability_grade_dict`** to use for looking up the "grade level" associated with a readability score (see [this table](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula)). This dictionary should have **keys** that are ___tuples___ giving the range of score for a particular grade (e.g., `(5.0, 5.9)`), and **values** that are ___strings___ representing the grade (e.g., `"5th or 6th grade"`). 

In [16]:
# Define the dictionary 'readability_grade_dict' to use for looking up the "grade level" associated with a readability score
readability_grade_dict = {
    (0.0, 4.9):'Grade 4th or below',
    (5.0, 5.9):'Grade 5th or 6th',
    (6.0, 6.9):'Grade 7th or 8th',
    (7.0, 7.9):'Grade 9th or 10th',
    (8.0, 8.9):'Grade 11th or 12th',
    (9.0, 9.9):'Grade 13th to 15th (college)'
}

Define a function **`print_grade()`** that takes in a readability score (a number greater than or equal to 0), and **prints** a string representing the grade associated with that score (from your `readability_grade_dict` dictionary).
- _Hint:_ loop through the items in the dictionary and determine which "tuple" key has elements that the score falls between. Be sure and round to the nearest decimal).

In [17]:
# Define the print_grade() function
def print_grade(read_score):
    """This function takes in a readability score (a number greater than or equal to 0), 
    and prints a string representing the grade associated with that score"""
    
    for k, v in readability_grade_dict.items():
        if round(read_score, 1) >= k[0] and round(read_score, 1) <= k[1]:
            print(v)

Now to calculate the readability scores! Define a function **`count_sentences()`** that counts the number of sentences in a string. Use the [sent_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) function from the `nltk.tokenize` module to break up a string into sentences (this is like the string `split()` function, but it splits into sentences rather than dividing by spaces).
- For help and an example, see [this guide](http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize).
- You do not need to do any extra processing beyond that provided by the `sent_tokenize()` function.
- Test your function on a simple pair or trio of sentences!

In [18]:
# Import 'nltk' module
import nltk

# Define the 'count_sentences' function
def count_sentences(str):
    """This function returns the number of sentences in a string"""
    
    return(len(nltk.tokenize.sent_tokenize(str)))

Define a function **`extract_words()`** that takes in a string and returns a _list_ of all of the words in that string. Use the [word_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) function from the `nltk.tokenize` module to break up the string into words.
- The `nltk` tokenizer includes each punctuation character (e.g., commas, periods) as individual "words". Your list should not include these items. You can use a string method to determine whether or not the word starts with a punctuation symbol, and if so exclude it. _Hint_ think about keeping good words, rather than throwing away the bad! Note that you do not need to do any special consideration for contractions or other words that include their own punctuation.
- Test your function on a simple sentence (with punctuation!).

In [19]:
# Import 'string' module
import string

# Define extract_words() function
def extract_words(str):
    """This function takes in a string and returns a list of all of the words in that string"""
    
    list_of_words = nltk.tokenize.word_tokenize(str)
    return_list = []
    
    # Check if the first character is an alphabet to keep in the return list
    for k in range(len(list_of_words)):
        if list_of_words[k][0].isalpha():
            return_list.append(list_of_words[k])
    
    # Return statement
    return(return_list)

Define a function **`count_easy_words()`** that takes in a _list_ of words as an argument and returns the number of words that are "easy".

- Your function should look up each word in the `easy_words_dict` you defined earlier. _Do not look up words in the list_ (the dictionary is much faster!). Be careful to look up lowercase versions of the word.

- Your function should handle detecting different parts of speech (e.g., plurals, different verb conjugations, etc.). You can do this by using the **`WordNetLemmatizer()`** function from the `nltk.stem.wordnet` module&mdash;which produces a "lemmatizer" object. You can call the **`lemmatize()`** method on this object to reduce a word to its "base" form. See [this example](https://pythonprogramming.net/lemmatizing-nltk-tutorial/). Note that you should reduce words to both their basic noun AND verb forms (you will need to call the function twice: once with `'n'` (noun) and once with `'v'` (verb) as the second argument!)

- You can test your function on the word list: `['My','words','spoken','have','consequences']`, which should have 4 of the 5 words considered easy (not "consequences").

In [20]:
# Import the WordNetLemmatizer from nltk.stem.wordnet module
from nltk.stem.wordnet import WordNetLemmatizer

# Define the count_easy_words() function
def count_easy_words(ls_words):
    """This function takes in a list of words as an argument and returns the number of words that are 'easy'"""
    
    return_count = 0
    lemmatizer = WordNetLemmatizer()
    
    # Loop through each word in the list
    for i in range(len(ls_words)):
        
        if lemmatizer.lemmatize(ls_words[i].lower(), 'v') in easy_words_dict.keys():
            return_count += 1
        elif lemmatizer.lemmatize(ls_words[i].lower(), 'n') in easy_words_dict.keys():
            return_count += 1
    
    # Return statement
    return return_count

# Test the count_easy_words() function
print(count_easy_words(['My','words','spoken','have','consequences']))

4


Define a function **`calc_readability_score()`** that takes in a string of text and returns a readability "score" for the test based on the [Dale-Chall readability formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula). Call your previous functions to calculate the number of sentences, total words, and number of difficult (not easy) words.
- Don't forget to adjust the score if the text is more than 5% difficult words!

In [21]:
# Define the calc_readability_score() function
def calc_readability_score(str):
    """This function takes in a string of text and returns a readability 'score'"""
    
    number_of_sentences = count_sentences(str)
    
    list_of_words = extract_words(str)
    number_of_words = len(list_of_words)
    
    number_of_easy_words = count_easy_words(list_of_words)
    difficult_words = number_of_words - number_of_easy_words
    
    # Calculate the readability score
    readability_score = (0.1579 * (difficult_words / number_of_words * 100)) + (0.0496 * (number_of_words / number_of_sentences))
    
    # Adjust the readability score if percentage of difficult words is above 5%
    if (((number_of_words - number_of_easy_words) / number_of_words) * 100) > 5:
        readability_score += 3.6365
    
    # Return statement
    return readability_score

Read in the text of the `data/alice.txt` file (the full text of Alice in Wonderland) _as a single string_. 

In [22]:
# Read the data from alice.txt
with open('data/alice.txt', encoding="utf-8") as alice:
    alice_in_wonder = alice.read()

Calculate the readability score for the `alice.txt` file and print it out. Then print out the reading grade associated with that score. Use your previously-defined functions!
- For testing, note that my calculations show `alice.txt` has 977 sentences and 27198 words, of which 3610 are difficult. This leads to a readability score of ~7.113.

In [23]:
# Calculate the readability score
score_alice = calc_readability_score(alice_in_wonder)

# Print the score
print("Readability Score of 'alice.txt':", score_alice)

# Print the corresponding grade
print_grade(score_alice)

Readability Score of 'alice.txt': 7.113090902410715
Grade 9th or 10th


_Note that this result may not be an especially accurate model of a text's readability&mdash;after all, it's just based on a simple estimation!_