# a5 - Data Analysis
Fill in the below code cells as specified. Note that cells may utilize variables and functions defined in previous cells; we should be able to use the `Kernal > Restart & Clear All` menu item followed by `Cell > Run All` to execute your entire notebook and see the correct output.

**IMPORTANT** You are <u>_not_</u> allowed to use comprehensions (a `for` loop inside of `[]` or `{}`) for this assignment. Use regular loops and control structures instead.

## Part 1. Numbers
For this part of the assignment, you will analyze some numeric data (counts of library holdings) to investiate how the distribution of numbers in natural data sets obeys the counter-intuitive [Benford's Law](https://plus.maths.org/content/os/issue9/features/benford/index). <small>(This exercise was adapted from Steve Wolfman).</small>

Create a variable **`holdings_data`** which is a **list** of the contents in the **`data/libraryholdings.txt`** file included in the repository (each line in the file will be a single element in the list). You will need to open up the file and read its contents into a list. You must specify a _local path_ to the file from this notebook's location.

In [2]:
with open("data/libraryholdings.txt") as datafile:
    holdings_data = []
    for line in datafile:
        holdings_data.append(line)

Print out the first **ten** items from the `holdings_data` list, each on its own line, in order to see what they are. (Note that there may be extra line breaks that are included in the data items themselves; you are welcome to `strip()` these off).

In [3]:
for i in range(0,10):
    data = str(holdings_data[i])
    print(data.strip('\n'))

(* Library holdings (# of books in each library), *)
(* collected by Christian Ayotte.                 *)
(* Labels not available.                          *)

12201
600778
14926
37863
14866
9896


Use the **slice operator (`:`)** to remove the "heading" and blank elements from the beginning of the data list, leaving only the list of numbers. The remaining values should continue to be stored (so re-assigned) in the `holdings_data` variable. Print the new 0th element of `holdings_data` to demonstrate that it is the first number in the data set (12201).
- Note that the values in the list are supposed to be strings rather than an integers; don't convert them!

In [4]:
holdings_data = holdings_data[4:]
print(holdings_data[0])

12201



Create a variable **`lead_digit_counts`** that is a dictionary whose keys are _strings_ of each digit (`"0"`, `"1"`, `"2"`, etc.), and whose values are all the number `0`. You can write out the literal or use a loop. Print out the variable after you create it.

In [5]:
lead_digit_counts = {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0}

Calculate the number of times each digit appears as the _first digit_ in a value of the `holdings_data` list, storing those counts in your `lead_digit_counts` dictionary. _Hint:_ loop through all of the items in the data set and for each item add 1 to the appropriate value in the dictionary.

In [6]:
for number in holdings_data:
    lead_digit_counts[number[0]] += 1

Use a for loop to print out each count in `lead_digit_counts` with the format:
```
X values have a leading digit of digit Y
```

In [7]:
for num, value in lead_digit_counts.items():
    print(str(num)+ " values have a leading digit of digit " + str(value))

0 values have a leading digit of digit 0
1 values have a leading digit of digit 3056
2 values have a leading digit of digit 1606
3 values have a leading digit of digit 1018
4 values have a leading digit of digit 801
5 values have a leading digit of digit 640
6 values have a leading digit of digit 560
7 values have a leading digit of digit 502
8 values have a leading digit of digit 503
9 values have a leading digit of digit 452


Print the _percentage_ of values in the the library holdings data set that have a leading digit **`1`** (rounded to 2 decimal places). Consider: does this value match what is predicted by [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law)?

In [8]:
sum_value = 0
for num, value in lead_digit_counts.items():
    sum_value += value
precentage = round(lead_digit_counts['1'] / sum_value , 4)
print(str(precentage * 100) + "%")

#Yes, the Benford's law predicts that number 1 appears as the leading significant digit about 30% of the time. 
#And number 1 appears 33% of the time in this case. 

33.44%


***Extra credit challenge:*** _Only attempt this problem once you have completed everything else!_

Create a single variable `digit_position_counts` that contains the number of times that each digit 0 through 9 appears in _each_ position in the data set. E.g., a `1` appears in the 1st position 3056 times and in the second position 1005 times; a `2` appears in the 1st position 1606 times and in the second position 1044 times.

Use this variable to print a "table" of the percentage of the time each position contains each digit (e.g., the 1st digit is a `1` 33.44% of the time, a `2` 17.57% of the time, etc).

Note that for this extra challenge it is up to you to determine an appropriate data structure (e.g., how to combine dictionaries and lists and tuples) for representing this table. Be sure and include comments (with `#`) explaining your work.

## Part 2. Life Expectancy
For this part of the assignment, you'll work with data about the life expectancy (in years) for each country in the world in the years 1960 and 2013. Note that this can be really [fun](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) data!

The data is found in a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file **`data/life_expectancy.csv`**.

Read in the contents of this data file, and use it to construct a **list** called **`life_expectancy_list`**. Each element in this list should be a **dictionary** (one for each row in the `csv` file) with the following keys and values:

- a key `'country'` whose value is the name of the country (as a string)
- a key `'le_1960'` whose value is the life expectancy in 1960 (as a float)
- a key `'le_2013'` whose value is the life expectancy in 2013 (as a float)

Thus the first record should look like:
```
{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}
```

You can use a for loop and the `split()` function to construct this list of dictionaries, similar to the process you've done in the last assignment. _Remember to convert the string values into floats using_ `float()`. Alternatively, you may instead use the the **`csv.reader`** function from the `csv` module; see [the documentation](https://docs.python.org/3/library/csv.html#csv.reader) for an example. 

Print out the _first row_ of your list as a demonstration that you've processed the data correctly.

In [9]:
import csv
life_expectancy_list = []
dic_life = {}
with open("data/life_expectancy.csv", newline = '') as lifedata:
    row = csv.reader(lifedata, delimiter = ',')
    for r in row:
        dic_life = {'country': r[0], 'le_1960':r[3], 'le_2013':r[4]}
        life_expectancy_list.append(dic_life)
life_expectancy_list.remove(life_expectancy_list[0])
print(life_expectancy_list[0])
    

{'country': 'Aruba', 'le_1960': '65.56936585', 'le_2013': '75.33217073'}


Add another item to _each_ dictionary in the `life_expectancy_list` whose **key** is `change` and whose **value** is the change in life expectancy from 1960 to 2013. For example, the `change` value for Aruba should be about 9.76280488. Print out the _last_ element of the `life_expectancy_list` to check your work.
- _Hint:_ Use a for loop and modify each dictionary in the list!

In [10]:
for dic in life_expectancy_list:
    dic['change'] = float(dic['le_2013'])-float(dic['le_1960'])
print(life_expectancy_list[len(life_expectancy_list) -1])

{'country': 'Zimbabwe', 'le_1960': '51.54246341', 'le_2013': '59.7734878', 'change': 8.231024389999995}


Create a variable **`num_small_gain`** that stores the **number of countries** whose life expectancy did not improve by 5 years or more between 1960 and 2013. This will include counties whose life expectancy has worsened (whose change is negative). Print out the `num_small_gain` variable.
- _Hint:_ use a for loop to "tally" all of the relevant countries.

In [11]:
num_small_gain = 0
for dic in life_expectancy_list:
    if dic['change']<5:
        num_small_gain +=1
print(num_small_gain)

7


Define a function **`compare_country_le()`** that takes in the names of _two_ countries, and returns a **tuple** containing the following information (in order):
- the name of the country with the greater life expectancy,
- the life expectancy in 2013 of that country
- the difference between the life expectancies in 2013

Use your function to print the comparison between the life expectancies of the _United States_ and _Cuba_.  

In [12]:
def compare_country_le(country1, country2):
    for dic in life_expectancy_list:
        if dic['country'] == country1:
            country1value = float(dic['le_2013'])
        if dic['country'] == country2:
             country2value = float(dic['le_2013'])
    diff = abs(country1value - country2value)
    if country1value > country2value:
        greater = country1
        life_expect = country1value
    else:
        greater = country2
        life_expect = country2value
    tuple_countries = (greater,life_expect, diff)
    return tuple_countries
            
print(compare_country_le("United States", "Cuba"))

('Cuba', 79.23926829, 0.39780487999999536)


## Advanced Part 3. Readability

> This is an advanced, very challenging set of requirements. If you are intending to pursue the Data Science specialization, you should definitely complete this section. But if that is not your focus that's okay&mdash;your score on this assignment will not be penalized if this section is incomplete.
> But we encourage everyone to at least try it out and see how far you get, if only for the practice and experience.

For this part of the assignment, you will calculate the [readability](https://en.wikipedia.org/wiki/Readability) of a text document using the [Dale-Chall Readability Formula](http://www.readabilityformulas.com/new-dale-chall-readability-formula.php). This method determines how "easy" it is to read a particular (English) document by considering the length of sentences and how many of the words used are "easy" to understand (based on a pre-defined list of "easy" words).
- Note that this part of the assignment involves researching and using an additional set of modules. If you have any questions or get stuck, ask for help!

You will first need to load the list of "easy" words into memory. This list can be found in the **`data/dale-chall.txt`** file. Open this file and read its entire contents into a **list** variable (e.g., `easy_words_list`), where each element in the list is a single line (word) in the file.

Print out the _length_ of this list variable to check your work. It should have 2942 entries (words).

In [16]:
with open("data/dale-chall.txt") as file_easywords:
    easywords = []
    for line in file_easywords:
        easywords.append(line.strip("\n"))
print (len(easywords))

2942


In order to "look up" easy words, convert the easy words list into a **dictionary** (e.g., `easy_words_dict`), where each **key** is a word, and each **value** is `True` (that the word is in the list). Make sure to strip off the newline characters so you do not include them in your keys!

In [18]:
easy_words_dict = {}
for word in easywords:
    if easy_words_dict.get(word, False) == False :
        easy_words_dict[word] = True
print(easy_words_dict)

{'a': True, 'able': True, 'aboard': True, 'about': True, 'above': True, 'absent': True, 'accept': True, 'accident': True, 'account': True, 'ache': True, 'aching': True, 'acorn': True, 'acre': True, 'across': True, 'act': True, 'acts': True, 'add': True, 'address': True, 'admire': True, 'adventure': True, 'afar': True, 'afraid': True, 'after': True, 'afternoon': True, 'afterward': True, 'afterwards': True, 'again': True, 'against': True, 'age': True, 'aged': True, 'ago': True, 'agree': True, 'ah': True, 'ahead': True, 'aid': True, 'aim': True, 'air': True, 'airfield': True, 'airplane': True, 'airport': True, 'airship': True, 'airy': True, 'alarm': True, 'alike': True, 'alive': True, 'all': True, 'alley': True, 'alligator': True, 'allow': True, 'almost': True, 'alone': True, 'along': True, 'aloud': True, 'already': True, 'also': True, 'always': True, 'am': True, 'america': True, 'american': True, 'among': True, 'amount': True, 'an': True, 'and': True, 'angel': True, 'anger': True, 'angry

Use your `easy_words_dict` to check if the word "information" is in the set of easy words. Use the `get()` method to return a value of `False` if the word is not there (instead of producing a `KeyError`). _You don't need to use a loop to do this!_

In [20]:
easy_words_dict.get('information', False)

False

Additionally, define a dictionary **`readability_grade_dict`** to use for looking up the "grade level" associated with the readability score you eventually compute (see [this table](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula)). This dictionary should have **keys** that are ___tuples___ giving the range of score for a particular grade (e.g., `(5.0, 5.9)`), and **values** that are ___strings___ representing the grade level of the text (e.g., `"5th or 6th grade"`). 

In [41]:
readability_grade_dict = {(9.0, 9.9): "13th to 15th grade", (8.0, 8.9): "11th or 12th grade", (7.0, 7.9): "9th or 10th grade", (6.0, 6.9): "7th or 8th grade",(5.0, 5.9): "5th or 6th grade",(0.0, 4.9): "4th grade or lower"}


Define a function **`print_grade()`** that takes in a readability score (a number greater than or equal to 0), and **prints** a string representing the grade associated with that score (from your `readability_grade_dict` dictionary). _Hint:_ you will need to loop through the items in the dictionary and determine which "tuple" key has elements that the score falls between. Be sure and round to the nearest decimal when considering the score to avoid errors with `5.95`. Test your function by printing out the "grade level" for a score of 6.4.

In [44]:
def print_grade(score):
    score = round(score, 1)
    for grade, value in readability_grade_dict.items():
        if score > grade[0] and score < grade[1]:
            print(value)
print_grade(6.4)

7th or 8th grade


Calculating the readability score of a document involves considering the individual words and sentences of that document. However, splitting real-world text documents into words and sentences is non-trivial (English is _hard_!)--you need much more than the `split()` method. In order to split up real-world text documents, in this section you will be using the [Natural Language Toolkit (nltk)](http://www.nltk.org/index.html) module. This module is installed along with Anacaonda, but does require some additional data source files to be installed on your computer for it to work properly. You should be able to do this by running the below cell (you only need to run it once):

In [45]:
from nltk import download
download('punkt')
download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yinwy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yinwy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

Now to calculate the readability scores! Define a function **`count_sentences()`** that takes as an argument a _string_ of text (which may contain mutliple sentecnes), and counts the number of sentences in a string. The function should **return** that count (a number). Use the [sent_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) function from the `nltk.tokenize` module to break up a string into sentences (this is like the string `split()` function, but it splits into sentences rather than dividing by a specific delimitr).
- For help and an example of the `sent_tokenize()` function, see [this guide](http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize).
- You should *not* do any extra processing beyond that provided by the `sent_tokenize()` function at this point!
- Test your function on a simple pair or trio of sentences to make sure it's working correctly.

In [52]:
import nltk
def count_sentences(text):
    count = 0
    sentence = nltk.tokenize.sent_tokenize(text, language = "English")
    for line in sentence:
        count += 1
    return count

count_sentences("This is a test test. I don't know what to say. This is the end.")

3

The next thing you'll need to do is to count the number of easy words. Define a function **`count_easy_words()`** that takes as an argument a _list_ of words as an argument and **returns** the number of words that are "easy".

- Your function should go through each word in the list, and look it up in the `easy_words_dict` you defined earlier (use the `.get()` method!). _Do not look up words in the origninal easy words list_ (the dictionary is much faster!). Be careful to look up lowercase versions of the word (hint: convert the word to lower case.

- Your function will also handle detecting different parts of speech (e.g., plurals, different verb conjugations, etc.). You can do this by using the **`WordNetLemmatizer()`** function from the `nltk.stem.wordnet` module&mdash;which produces a "lemmatizer" object. You can call the **`lemmatize()`** method on this object to reduce a word to its "base" form. See [this example](https://pythonprogramming.net/lemmatizing-nltk-tutorial/) for details. Note that you should reduce words to both their basic noun AND verb forms (meaning you will need to call the `lemmatize()` function twice: once with `'n'` (noun) and once with `'v'` (verb) as the `pos` argument&mdash;and then check if _either_ the noun stem **or** the verb stem is an "easy word").

- You can test your function on the word list: `['My','words','spoken','have','consequences']`, which should have 4 of the 5 words considered easy (not "consequences").

In [60]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def count_easy_words(wordlist):
    count = 0
    for word in wordlist:
        word = word.lower()
        word_v = lemmatizer.lemmatize(word,'v')
        word_n = lemmatizer.lemmatize(word,'n')
        if easy_words_dict.get(word_v,False) or easy_words_dict.get(word_n,False):
            count += 1
    return count
count_easy_words(['My','words','spoken','have','consequences'])

4

As with sentences, splitting up natural language into a list of words is tricky because of complex punctuation, contractions, etc. Thus you should use the below `extract_words()` function to "split" the text into a list of words to consider. This will handle punctuation/etc. in a consistent (if not overly robust) way.
- Thus function uses the the [word_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) function from the `nltk.tokenize` module to break up the string into words. It includes each punctuation character (e.g., commas, periods) as individual "words"; the `extract_words()` function removes these from consideration so you don't need to worry about them.

In [71]:
from nltk.tokenize import word_tokenize
def extract_words(text):
    raw_words = word_tokenize(text)
    words = []
    for word in raw_words:
        if(word[0].isalpha()):
            words.append(word)
    return words


Finally, define a function **`calculate_readability_score()`** that takes in a string of text and returns a readability "score" (a number) for the test based on the [Dale-Chall readability formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula). Call your previous functions to calculate the number of sentences, total words, and number of difficult (not easy) words:
- Start by counting the number of sentences, then by extracting the words and counting the number of easy ones. Follow the formula to weight these values together. 
- Don't forget to adjust the score if the text if more than 5% of the words in the text are difficult!

In [85]:
def calculate_readability_score(text):
    sentence_count = count_sentences(text)
    print("Count Sentence = "+ str(sentence_count))
    text_word = extract_words(text)
    word_count = len(text_word)
    print('Count Total Words: ' + str(word_count))
    easy_word_count = count_easy_words(text_word)
    diff_word_count = word_count - easy_word_count
    print('Count Difficult words: '+ str(diff_word_count))
    score = 0.1579 * ((diff_word_count / word_count)*100) + 0.0496 * word_count / sentence_count  #raw score
    if diff_word_count / word_count > 0.05:
        score = score + 3.6365  #adjust score
    return(round(score,1))


Read in the text of the `data/alice.txt` file (the full text of Alice in Wonderland) _as a single string_ (use the `.read()` method). 

In [76]:
with open("data/alice.txt",encoding = 'UTF-8') as Alice:
    alice = Alice.read()

Calculate the readability score for the `alice.txt` file and print it out. Then use your `calculate_readability_score()` function to print out the reading grade associated with that score. Use your previously-defined functions!
- For testing, my calculations show `alice.txt` has 977 sentences and 27199 words, of which 3611 are difficult. This leads to a readability score of ~7.113. Note that it's okay if your numbers are slightly off; different operating systems or slightly different approaches can produce different counts because of how `ntlk` works.

In [86]:
score = calculate_readability_score(alice)
print(score)
print_grade(score)

Count Sentence = 977
Count Total Words: 27199
Count Difficult words: 3611
7.1
9th or 10th grade


_Note that this result may not be an especially accurate model of a text's readability&mdash;after all, it's just based on a simple estimation!_