# Exploring data

We will use the MovieLens 100K dataset for performing our initial survey of user and movie data. 

See http://files.grouplens.org/datasets/movielens/ml-100k-README.txt for more information on this dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

In [None]:
!rm -f /tmp/ml-100k.zip
!rm -rf /tmp/ml-100k
!wget -O /tmp/ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -j -o /tmp/ml-100k.zip -d /tmp/ml-100k

# MovieLens Dataset Info

We can parse the file u.info to find details about this dataset

```
u.info     -- The number of users, items, and ratings in the u data set.
```

In [None]:
%cd /tmp/ml-100k
!cat u.info

## Movies

Let's start by looking at the movie data which represent the items that are rated and recommended

First, we will look at the u.genre to understand how these movies are categorized

In [None]:
!cat u.genre

To find the information about each movie title in our dataset, we check the u.item file

```
u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.
```

In [None]:
!head -10 u.item

The list of items has several columns of 0/1 which represent the one-hot encoding of genere information -- we'll add that to our table header as we load the data in a Pandas dataframe

In [None]:
items = pd.read_csv("u.item", encoding = "ISO-8859-1", sep='|', header=None,
    names=['title', 'published','', 'url','genre_unknown', 'genre_action', 'genre_adventure', 'genre_animation', 'genre_childrens','genre_comedy','genre_crime','genre_documentary','genre_drama','genre_fantasy','genre_film-noir','genre_horror','genre_musical','genre_mystery','genre_romance','genre_scifi','genre_thriller','genre_war','genre_western'])
items.head()

## Users

We have some limited demographics data about users:

```
u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.
```
Let's have a look

In [None]:
!cat u.user

Users are identified by age, gender, occupation and postal code

In [None]:
users = pd.read_csv("u.user", sep='|', header=None, 
    names=['userid', 'age', 'gender', 'occupation', 'postal_code'])
users.head()

In [None]:
users['age'].hist(bins=10)

In [None]:
users['gender'].hist(bins=2)

*We'll return to users analysis later when we perform user clustering in 02_clustering_users.ipynb*

# Ratings

Let's now have a look at the ratings data -- the u.data file contains individual user preference for movies/items as specified in 1 (least liked) to 5(most liked) ratings for a limited set of movies.

```
u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   
```

In [None]:
data = pd.read_csv("u.data", sep='\t', header=None, 
    names=['userid', 'movieid', 'rating', 'timestamp'])
data.head()

In [None]:
print("Number of Users: %d" % (data['userid'].max()))
print("Number of Movies: %d" % (data['movieid'].max()))

In [None]:
by_user = data.groupby('userid')
ratings_hist = by_user['movieid'].count().hist(bins=100)
ratings_hist.set_title('Ratings count distribution')

In [None]:
cust_size = by_user.size()
cust_size.sample(random_state=42)

#print(cust_size)
#cust_size.plot(kind='bar')
#cust_size.count()

In [None]:
data['rating'].describe()

In [None]:
toy_story=data[(data.movieid == 1)]
toy_story.head()
#toy_story['rating'].hist()

In [None]:
toy_story['rating'].mean()

In [None]:
toy_story['rating'].hist()