# Data Exploration in Python (without pandas)

In this notebook, we will use python to parse some datasets.

### Amazon Reviews Dataset

Up until now, we've worked mainly with CSV data that has been easy to import. However, often its not the modeling part that is difficult (once we understand the models) but getting data into a format that we can use for modeling.

Today we'll look at 2 datasets from Amazon. The first is a collection of movie reviews (the original set of 7 million reviews is also available) and the second is meta data on products including which customers reviewed it. However, neither is in a ready-to-use format so we must work on that first.

### Amazon Movie Reviews

- We use limited version of this data, which is stored in our repo under `/data/amazon/small-movies.txt`
- The full data set is available
<a href="http://snap.stanford.edu/data/web-Movies.html" target="_blank">here</a>.

In [40]:
# Replace the following path to your own local path
path_to_repo = "/Users/ruben/repo/personal/ga/DAT-23-NYC/"
path_to_data = path_to_repo + "data/amazon/"

In [41]:
# Load the dataset into a list of separate lines in python
with open(path_to_data + 'small-movies.txt') as f:
    lines = [line for line in f]

In [42]:
# Print first lines of the datafile
for i in xrange(8):
    print lines[i],  # the comma surpresses the newline that the print command prints by default

product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to El Paso, Texas an ever increasing number of female workers are found raped and murdered in the surrounding desert. Investigative reporter Karina Danes (Minnie Driver) arrives from Los Angeles to pursue the story and angers both the local police and the factory owners who employee the undocumented aliens with her pointed questions and relentless quest for the truth.<br /><br />Her story goes nationwide when a young girl named Mariela (Ana Claudia Talancon) survives a vicious attack and walks out of the desert crediting the Blessed Virgin for her rescue. Her story is further enhanced when the "Wounds of Christ" (stigmata) appear in her palms. She also claims to have received a me

In [43]:
# Print the entire file - your notebook will crash if your file is very large
# print ''.join(lines)

#### Exercise 1

Create a tab-separated file from the file above the contains the following columns: productId, userId, review text, helpfulness score (as a numeric value) and review score (as a numeric value).
Note: What are the issues with helpfulness? How can you resolve them?

In [44]:
# Hint: use startswidth
"Hello, world!".startswith("Hello")

True

In [45]:
# Put here your code
reviews = []

pass  # your code

for line in lines:
    pass  # your code


In [46]:
rows = []

for review in reviews:
    pass  # your code

In [47]:
rows = ['1\tqwqw' for i in xrange(3)]

In [48]:
output_file = path_to_data + 'small-movies-results.csv'
with open(output_file, 'w') as f:
    for row in rows:
        f.write(row + '\n')

Let's ensure that you have a properly formatted TSV and that you can parse it back in with Pandas.

In [50]:
import pandas as pd
df = pd.read_csv(output_file, sep='\t', header=None)
df.head()

Unnamed: 0,0,1
0,1,qwqw
1,1,qwqw
2,1,qwqw


### Amazon Metadata

A limited version of this data is available here and the full data set is available here

- We use limited version of this data, which is stored in our repo under `/data/amazon/small-amazon-meta.txt`
- The full data set is available
<a href="http://snap.stanford.edu/data/amazon-meta.html" target="_blank">here</a>.

The task here is to parse this file into a collection of product ids (ASIN), title and list of customers (by id) who have reviewed the product.

```
productID, title, [customer1, customer2]
```

#### Exercise 2

- (**) If you'd like create a review class that holds the customer id and star rating. Use this to output a product id, title and list of reviews.
- (*) Create a product class that holds the id, title and collection of reviews.

### Lahman Baseball Dataset

Available here: 
<a href="http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip">Lahman Baseball Dataset</a>.

- Without using Pandas, read in Salaries.csv and output average salary by playerID.
- (**) Without using Pandas, read in Salaries.csv and Master.csv and output salary by nameFirst and nameLast.