# Homework 1 - Data Structures and Sorting

## Due: Friday, Sep 21, 2018 at 11:59:00pm

### Name:  Samantha Cohen 
### Uniqname: samcoh
### People you worked with: Rhea

### Submission instructions
After completing this homework, you will turn in two files via Canvas ->  Assignments -> Homework 1:

- Your Notebook, named ```si330-hw1-YOUR_UNIQUE_NAME.ipynb```
- The HTML file, named ```si330-hw1-YOUR_UNIQUE_NAME.html```

## Objectives
After completing this homework assignment, you should know how to
* use compound data structures
* perform simple and complex sorting
* use lambda functions

In addition, this assignment will provide an opportunity to work with a large (100,000 row) data set.

## Background

Massive Open Online Courses (MOOCs) are a popular way for people to learn new skills.  The University of Michigan
offers many different MOOCs, which are produced by faculty members and supported by the Office of Academic 
Innovation.

MOOCs tend to be used by hundreds to hundreds of thousands of users.  These users leave "digital exhaust" when
they work through the MOOC in the form of web server log entries.  We have obtained a small sample of these data
files from Prof. Chris Brooks, who is a colleague here at UMSI.  The data files are de-identified: anything
that could identify a person, such as their UMID or their IP address are "hashed" (encrypted).  Each line in the
data file represents a "page view" by a user.  The schema for each line is:

```umich_user_id, hashed_session_cookie_id, server_timestamp, hashed_ip, user_agent, url, initial_referrer_url, rowser_language, course_id, country_cd, region_cd, timezone, os, browser, key, value```

Of note is the ```umich_user_id```, which identifies each user, and ```hashed_session_cookie_id``` which identifies a session.
Sessions are important:  they represent a collection of pageviews between the time that a user logs in and,
usually, when they log out.

In the lab, we went through the motions of some manipulation of the MOOC log data.  For this assignment, you'll try to answer several real-world questions:

1. How many different countries (based on ```country_cd```) are represented in the data file?
2. What are the top 5 countries with the most number of page views?
3. For people accessing the MOOC from the US, what is the average number of page views per session?
4. What are the top 5 sessions in terms of their number of logs?

In addition to the MOOC data file, you're also going to use a file called ```countrycodes.tsv``` to map
2-digit country codes to the full name of the country.  Why?  Because not everyone knows that PF is 
"French Polynesia".

The rest of the notebook contains specific steps that you need to follow and complete.

First, let's load up the ```csv``` library; we're going to need it to read the comma- and tab-separated 
values files.

In [2]:
import csv
from collections import defaultdict

## Part 1. Import the data

You'll load the data from the two files ```mooc_small.csv``` and ```countrycodes.tsv``` into two separate 
data structures. 

### Part 1.1 Country codes

Let's start with ```countrycodes.tsv```.  Remember, we're going to use that file to map from the 
2-digit country code to the country name (e.g. from "CA" to "Canada").  




 <font color="magenta">Modify the next block of code so that it loads ```countrycodes.tsv``` into a data structure
    that would allow you to efficiently look up the country name that corresponds to the 2-digit country code.</font>

In [3]:
country_names = {} # CHANGE ME: Change "None" to the appropriate data structure

with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    for row in reader:
        country_names[row["ISO ALPHA-2 Code"]] = row["Country or Area Name"]
        #pass # CHANGE ME: Change this line to populate the data structure you created above with the data from the file
#country_names   

### Part 1.2 MOOC data

Now load the MOOC log data into an appropriate data structure (start with the mooc_small.csv file, then remember to change to mooc_big.csv). For this file, you should store all the rows in a data structure.


 <font color="magenta">Modify the next block of code so that it loads the MOOC log data into a data structure 
   that will allow you to answer the three real-world questions.</font>

In [4]:
mooc_data_file_name = "mooc-small.csv"

mooc_data = [] # CHANGE ME: Change "None" to the appropriate data structure

with open(mooc_data_file_name, "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = ",", quotechar = '"')
    for row in reader:
        mooc_data.append(row)
        #pass; # CHANGE ME: Change this line to populate the data structure you created above with data from the file
#mooc_data
#print(mooc_data)

## Part 2. Manipulating and interpreting the data to answer our questions

### Part 2.1: Unique countries

Now that we have our data loaded, we can start to answer the real-world questions.

Recall that the first question
is **"How many different countries (based on `country_cd`) are represented in the data file?"**

To do this, you're going to have to figure out how many unique country codes there are in the MOOC log file. There are few different ways to do this, but you probably want to use either a ```set``` or a ```dict```.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the number of countries represented in the MOOC log file.</font>

In [5]:
countries = {} # CHANGE ME: Change "None" to an appropriate data structure

for row in mooc_data:
    if row["country_cd"] not in countries: 
        countries[row["country_cd"]] = []
    countries[row["country_cd"]].append(row)
    #pass # CHANGE ME: Change this line to include code that will populate your data structure
#print(countries)
# Do not change the following line
print("There are {0} unique countries in the MOOC log data file.".format(len(countries)))

There are 19 unique countries in the MOOC log data file.


### Part 2.2: Top 5 countries

Next you want to find out the <b>top 5 countries with the most page views</b> (each row in the MOOC log file is counted as one page view). There are multiple ways to finish this. But here, you need to implement a composite data structure (a *dictionary of lists*) which stores, for each country, meta data from each log (specifically, the ```hashed_session_cookie_id```). This data structure will be used to answer the 3rd question later. Think about how you would populate this data structure.

After that you will sort the data structure using the ```sorted()``` function. You will need to write down the code to provide the ```sorted()``` function with a `key` parameter using ```lambda```. This will specify what the data structure will be sorted by.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the top 5 of countries represented in the MOOC log file, and the corresponding number of users.</font>

In [10]:
country_user_data = {} # CHANGE ME: Change "None" to an appropriate data structure

for row in mooc_data:
    if row["country_cd"] not in country_user_data:
        country_user_data[row["country_cd"]] = []
    country_user_data[row["country_cd"]].append(row["hashed_session_cookie_id"])
    # CHANGE ME: Include code that will populate your data structure
#print(country_user_data) 

# CHANGE ME: Add a key parameter to the sorted() function to sort `country_user_data` by the number of users from that country.
sorted_country_user_data = sorted(country_user_data.items(),key = lambda x: len(x[1]), reverse = True)

#below is a different way to do it (Note: doesnt work for this question)

#sorted_country_user_data = sorted(country_user_data, key = lambda x: len(country_user_data[x]), reverse = True) 
#print(sorted_country_user_data)

#print statements to check work: 

#print(sorted_country_user_data)
#print(sorted(country_user_data.items(), key = lambda x: len(x[1]), reverse = True) )

#for x in sorted_country_user_data: 
    #print(country_names[x[0]], ":",len(x[1]))

    
# Do not change the following lines of code. 
# This should output the top 5 countries, along with the number of users from each of those countries.
#print(sorted_country_user_data)

for i in range(5):
    print(country_names[sorted_country_user_data[i][0]], ':', len(sorted_country_user_data[i][1]))


    

United States of America : 44
Canada : 10
Germany : 8
French Polynesia : 5
Belarus : 5


From this step on, you will be working with the ```country_user_data``` data structure.

### Part 2.3 Filter to US data

Here, you will need to <b>filter the data so you only have entries from the US (i.e. where COUNTRY_CD is US)</b>. Then you can retrieve the number of logs for each session, i.e. which have the same ```hashed_session_cookie_id```.

From the ```country_user_data``` data structure, you can retrieve the entries from US. Then, you will create a new data structure called `us_data` using a ```defaultdict```, 
which you will use to can count the number of logs (number of rows) in each unique session from the US (sessions are uniquely identified by ```hashed_session_cookie_id```).
The number of logs will give you the number of pages a user has viewed in one session.

Modify the following code block so the data structure `us_data` contains counts of the number of logs for each session from the US.

In [6]:
us_data = defaultdict(int) # CHANGE ME: Change none to the appropriate data structure
for row in country_user_data['US']:
    us_data[row] += 1
    #pass # CHANGE ME: Write your code here to store the number of log entries per session in us_data
#print(us_data)

### Part 2.4 Average number of pageviews per session

Now, you need to calculate the <b>average number of pageviews per session</b> for users in the US. While the ```numpy``` package, which will be covered later in the semester, has a built-in method for calculating means, for now you will do this manuallly. You will iterate over the values, sum them up, and divide by the number of values. Use the ```sum``` and ```len``` methods.

<font color="magenta">In the following block of code put in the formula for calculating the average.</font>

In [7]:
avg_page_views_per_session = sum(us_data.values())/ len(us_data)  # CHANGE ME: change this to count the average number of logs per session
print(avg_page_views_per_session)

1.76


### Part 2.5 Find top 5 sessions

Finally, you want to <b>sort the sessions to retrieve the ones have maximum number of logs</b>. Implement a ```sorted``` function, pass the appropriate ```lambda``` function to the ```key``` parameter and store it into the data structure ```sorted_us_data```.

In the following block, write down the code for the `sorted` function. The `print` statement should output the top 5 `hashed_session_cookie_id` and the corresponding number of logs for that session.

In [8]:
sorted_us_data = sorted(us_data.items(), key= lambda t: t[1], reverse = True)  # CHANGE ME: Change this line to include a sorted function.

for i in range(5):
    print(sorted_us_data[i]) #This will print out the top 5 sessions with their hashed_session_cookie_id and no. of log entries

('d8fe83dbeba4af9b001d3ad8f8aa8940b40e06ce', 6)
('9431b24e18b18ea6b5aea81920abd33fb9ce55ee', 4)
('c13cb2cdb6e7ebbc4e1e434a29e449e221f3c5d3', 3)
('e0f1598cc697187a9ab35f12562f7ad7ce2dcc2a', 3)
('85bf1f93b06602d828147c5b2ffabb066e63c4b1', 3)


## Part 3 BONUS (up to 10 points)

For BONUS points, re-write all the code from above (including any code we provided) without using *any* loops, **only** comprehensions.

Leave your original code above, and **create new blocks below** containing your BONUS code for each section from Part 1 and Part 2.

In [9]:
## Write your BONUS code here

First, let's load up the ```csv``` library; we're going to need it to read the comma- and tab-separated 
values files.

In [10]:
import csv
from collections import defaultdict

## Part 1. Import the data (BONUS)

You'll load the data from the two files ```mooc_small.csv``` and ```countrycodes.tsv``` into two separate 
data structures. 

### Part 1.1 Country codes (BONUS) 

Let's start with ```countrycodes.tsv```.  Remember, we're going to use that file to map from the 
2-digit country code to the country name (e.g. from "CA" to "Canada").  



In [11]:
country_names = {} # CHANGE ME: Change "None" to the appropriate data structure

with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    country_names= {row["ISO ALPHA-2 Code"]:row["Country or Area Name"] for row in reader}
    # CHANGE ME: Change this line to populate the data structure you created above with the data from the file
#country_names
    

### Part 1.2 MOOC data (BONUS)

Now load the MOOC log data into an appropriate data structure (start with the mooc_small.csv file, then remember to change to mooc_big.csv). For this file, you should store all the rows in a data structure.


 <font color="magenta">Modify the next block of code so that it loads the MOOC log data into a data structure 
   that will allow you to answer the three real-world questions.</font>

In [12]:
mooc_data_file_name = "mooc-small.csv"

with open(mooc_data_file_name, "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = ",", quotechar = '"')
    mooc_data = [row for row in reader]
    #pass; # CHANGE ME: Change this line to populate the data structure you created above with data from the file


## Part 2. Manipulating and interpreting the data to answer our questions (BONUS)

### Part 2.1: Unique countries (BONUS)

Now that we have our data loaded, we can start to answer the real-world questions.

Recall that the first question
is **"How many different countries (based on `country_cd`) are represented in the data file?"**

To do this, you're going to have to figure out how many unique country codes there are in the MOOC log file. There are few different ways to do this, but you probably want to use either a ```set``` or a ```dict```.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the number of countries represented in the MOOC log file.</font>

In [13]:
countries = {} # CHANGE ME: Change "None" to an appropriate data structure
countries = {row["country_cd"]: [] for row in mooc_data}
[countries[row["country_cd"]].append(row) for row in mooc_data]
#pass # CHANGE ME: Change this line to include code that will populate your data structure
#print(countries)
# Do not change the following line
print("There are {0} unique countries in the MOOC log data file.".format(len(countries)))

There are 19 unique countries in the MOOC log data file.


### Part 2.2: Top 5 countries (BONUS)

Next you want to find out the <b>top 5 countries with the most page views</b> (each row in the MOOC log file is counted as one page view). There are multiple ways to finish this. But here, you need to implement a composite data structure (a *dictionary of lists*) which stores, for each country, meta data from each log (specifically, the ```hashed_session_cookie_id```). This data structure will be used to answer the 3rd question later. Think about how you would populate this data structure.

After that you will sort the data structure using the ```sorted()``` function. You will need to write down the code to provide the ```sorted()``` function with a `key` parameter using ```lambda```. This will specify what the data structure will be sorted by.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the top 5 of countries represented in the MOOC log file, and the corresponding number of users.</font>

In [14]:
country_user_data = {row["country_cd"]: [] for row in mooc_data} # CHANGE ME: Change "None" to an appropriate data structure
[country_user_data[row["country_cd"]].append(row["hashed_session_cookie_id"]) for row in mooc_data]

sorted_country_user_data = sorted(country_user_data.items(),key = lambda x: len(x[1]), reverse = True)
empty_ls=[print(country_names[sorted_country_user_data[i][0]], ':', len(sorted_country_user_data[i][1])) for i in range(5)]
#for i in range(5):
    #print(country_names[sorted_country_user_data[i][0]], ':', len(sorted_country_user_data[i][1]))


United States of America : 44
Canada : 10
Germany : 8
French Polynesia : 5
Belarus : 5


### Part 2.3 Filter to US data (BONUS)
​
Here, you will need to <b>filter the data so you only have entries from the US (i.e. where COUNTRY_CD is US)</b>. Then you can retrieve the number of logs for each session, i.e. which have the same ```hashed_session_cookie_id```.
​
From the ```country_user_data``` data structure, you can retrieve the entries from US. Then, you will create a new data structure called `us_data` using a ```defaultdict```, 
which you will use to can count the number of logs (number of rows) in each unique session from the US (sessions are uniquely identified by ```hashed_session_cookie_id```).
The number of logs will give you the number of pages a user has viewed in one session.

Modify the following code block so the data structure `us_data` contains counts of the number of logs for each session from the US.

In [15]:
us_data = defaultdict(int,{row: len([row2 for row2 in country_user_data['US'] if row == row2]) for row in country_user_data['US']}) # CHANGE ME: Change none to the appropriate data structure 
#print(us_data)


### Part 2.4 Average number of pageviews per session (BONUS)

Now, you need to calculate the <b>average number of pageviews per session</b> for users in the US. While the ```numpy``` package, which will be covered later in the semester, has a built-in method for calculating means, for now you will do this manuallly. You will iterate over the values, sum them up, and divide by the number of values. Use the ```sum``` and ```len``` methods.

<font color="magenta">In the following block of code put in the formula for calculating the average.</font>

In [16]:
avg_page_views_per_session = sum(us_data.values())/ len(us_data)  # CHANGE ME: change this to count the average number of logs per session
print(avg_page_views_per_session)

1.76


### Part 2.5 Find top 5 sessions (BONUS)

Finally, you want to <b>sort the sessions to retrieve the ones have maximum number of logs</b>. Implement a ```sorted``` function, pass the appropriate ```lambda``` function to the ```key``` parameter and store it into the data structure ```sorted_us_data```.

In the following block, write down the code for the `sorted` function. The `print` statement should output the top 5 `hashed_session_cookie_id` and the corresponding number of logs for that session.

In [17]:
sorted_us_data = sorted(us_data.items(), key= lambda t: t[1], reverse = True)  # CHANGE ME: Change this line to include a sorted function.
empty_l= [print(sorted_us_data[i]) for i in range(5)]
#for i in range(5):
    #print(sorted_us_data[i]) #This will print out the top 5 sessions with their hashed_session_cookie_id and no. of log entries

('d8fe83dbeba4af9b001d3ad8f8aa8940b40e06ce', 6)
('9431b24e18b18ea6b5aea81920abd33fb9ce55ee', 4)
('c13cb2cdb6e7ebbc4e1e434a29e449e221f3c5d3', 3)
('e0f1598cc697187a9ab35f12562f7ad7ce2dcc2a', 3)
('85bf1f93b06602d828147c5b2ffabb066e63c4b1', 3)
