In [2]:
import pandas as pd
import numpy as np

# Read data
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

## Reading users file
User information is in the file "users.dat" and is in the following format:

`UserID::Gender::Age::Occupation::Zip-code`

All demographic information is provided voluntarily by the users and is not checked for accuracy. 
Only users who have provided some demographic information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:
    - 1:  "Under 18"
    - 18:  "18-24"
    - 25:  "25-34"
    - 35:  "35-44"
    - 45:  "45-49"
    - 50:  "50-55"
    - 56:  "56+"

- Occupation is chosen from the following choices:
    - 0:  "other" or not specified
    - 1:  "academic/educator"
    - 2:  "artist"
    - 3:  "clerical/admin"
    - 4:  "college/grad student"
    - 5:  "customer service"
    - 6:  "doctor/health care"
    - 7:  "executive/managerial"
    - 8:  "farmer"
    - 9:  "homemaker"
    - 10:  "K-12 student"
    - 11:  "lawyer"
    - 12:  "programmer"
    - 13:  "retired"
    - 14:  "sales/marketing"
    - 15:  "scientist"
    - 16:  "self-employed"
    - 17:  "technician/engineer"
    - 18:  "tradesman/craftsman"
    - 19:  "unemployed"
    - 20:  "writer"

In [12]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('data/users.dat', sep='::', names=u_cols, encoding='latin-1')

df_users.head()

  


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


## Reading items file
Movie information is in the file "movies.dat" and is in the following format:

`MovieID::Title::Genres`

- Titles are identical to titles provided by the IMDB (including year of release)
- Genres are pipe-separated and are selected from the following genres:

    - Action
    - Adventure
    - Animation
    - Children's
    - Comedy
    - Crime
    - Documentary
    - Drama
    - Fantasy
    - Film-Noir
    - Horror
    - Musical
    - Mystery
    - Romance
    - Sci-Fi
    - Thriller
    - War
    - Western

- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist


In [14]:
i_cols = ['movie_id', 'title', 'genres']
df_movies = pd.read_csv('data/movies.dat', sep='::', names=i_cols, encoding='latin-1')

df_movies.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## Reading ratings file
All ratings are contained in the file "ratings.dat" and are in the
following format:

`UserID::MovieID::Rating::Timestamp`

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings


In [15]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
df_ratings = pd.read_csv('data/ratings.dat', sep='::', names=r_cols,encoding='latin-1')
df_ratings.head()



  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
