#### Visualising the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Reading the dataset

In [2]:
df = pd.read_csv("Datasets/u.data")
df.head()

Unnamed: 0,196\t242\t3\t881250949
0,186\t302\t3\t891717742
1,22\t377\t1\t878887116
2,244\t51\t2\t880606923
3,166\t346\t1\t886397596
4,298\t474\t4\t884182806


This makes no sense, and we see they are seperated by tabs, hence we use the tab seperator

In [3]:
df = pd.read_csv("Datasets/u.data", sep="\t")
df.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


But there are no headers, hence we have to give our own

In [4]:
df = pd.read_csv("Datasets/u.data", sep="\t", names=["user_id", "item_id", "rating", "timestamp"])
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Now we have our proper dataset

In [5]:
df.shape

(100000, 4)

In [6]:
# To check unique users
df["user_id"].nunique()

943

In [7]:
# For unique movies
df["item_id"].nunique()

1682

So we have $943$ users, and $1682$ movies

#### Details of these movies are given in the `u.item` file

In [11]:
movies = pd.read_csv("Datasets/u.item", sep="|", encoding='latin-1', header=None)
movies.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [12]:
movies.shape

(1682, 24)

We see that we have all the 1682 movie information here, hence we can extract titles from here

In [13]:
# Extracting
movie_titles = movies[[0,1]]
movie_titles.head()

Unnamed: 0,0,1
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [14]:
movie_titles.columns = ["item_id", "title"]
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


#### `merge()`
- Merges 2 datasets
- Works like `JOIN` in SQL

In [15]:
new_df = pd.merge(df, movie_titles, on="item_id")
new_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


The titles have been added as required

In [16]:
new_df.tail()

Unnamed: 0,user_id,item_id,rating,timestamp,title
99995,840,1674,4,891211682,Mamma Roma (1962)
99996,655,1640,3,888474646,"Eighth Day, The (1996)"
99997,655,1637,3,888984255,Girls Town (1996)
99998,655,1630,3,887428735,"Silence of the Palace, The (Saimt el Qusur) (1..."
99999,655,1641,3,887427810,Dadetown (1995)
