# Loading the Data

Each data files named combined_data contains :

1) MovieIDs

2) CustomerIDs

3) Ratings

4) Dates of Rating

The data file named movie_titles contains : 

1) MovieID

2) YearOfRelease

3) Title

### Importing Libraries

Importing general libraries for working with the data: numpy, pandas, matplotlib, seaborn. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

As the data is too complex to be loaded directly, we will make a function which which convert each text file into a seperate data frame, we will combine all the dataframe later to get a single dataframe to work with.

My system does not support loading of such a huge data, so I will load just a few of them (10000 each file).

In [8]:
# Fuction to read each file

def read_text(path, rows=10000):
    data = {'Cust_Id' : [], 'Movie_Id' : [], 'Rating' : [], 'Date' : []}  #Creating dictionary to store each attribute 
    f = open(path, "r")
    count = 0
    for line in f:
        count += 1   # As right now we will be using only 10k data
        if count > rows:
            break
            
        if ':' in line:
            movidId = line[:-2] # remove the last character ':'
            movieId = int(movidId)
        else:
            customerID, rating, date = line.split(',')
            data['Cust_Id'].append(customerID)
            data['Movie_Id'].append(movieId)
            data['Rating'].append(rating)
            data['Date'].append(date.rstrip("\n"))  # rstrip("\n") removes the trainling character "\n"
    f.close()
            
    return pd.DataFrame(data)

In [7]:
# Converting all files into dataframes

df1 = read_text('C:/Users/umar/Documents/Machine Learning/Recommender System/Data/combined_data_1.txt', rows=10000)
df2 = read_text('C:/Users/umar/Documents/Machine Learning/Recommender System/Data/combined_data_2.txt', rows=10000)
df3 = read_text('C:/Users/umar/Documents/Machine Learning/Recommender System/Data/combined_data_3.txt', rows=10000)
df4 = read_text('C:/Users/umar/Documents/Machine Learning/Recommender System/Data/combined_data_4.txt', rows=10000)

# converting ratings into float
# (example: '3'->3.0)

df1['Rating'] = df1['Rating'].astype(float)
df2['Rating'] = df2['Rating'].astype(float)
df3['Rating'] = df3['Rating'].astype(float)
df4['Rating'] = df4['Rating'].astype(float)

In [9]:
# Merging all the dataframes

df = df1.copy()
df = df.append(df2)
df = df.append(df3)
df = df.append(df4)
df.index = np.arange(0,len(df))

In [10]:
df.head()

Unnamed: 0,Cust_Id,Movie_Id,Rating,Date
0,1488844,1,3.0,2005-09-06
1,822109,1,5.0,2005-05-13
2,885013,1,4.0,2005-10-19
3,30878,1,4.0,2005-12-26
4,823519,1,3.0,2004-05-03


In [13]:
df.shape

(39967, 4)

In read_text function we are increasing count even if we just encountered a movie_id, which is absurd. That is the reason why we are getting 39967 coloumns and not 40000 coulumns, to fix this we can shift our "count++" inside the else statement. But I am gonna continue with this only. 

In [18]:
# saving the dataframe

df.to_csv('data.csv')

In [17]:
# Loading movie_titles

df_title = pd.read_csv('C:/Users/umar/Documents/Machine Learning/Recommender System/Data/movie_titles.csv', header = None, names = ['Movie_Id', 'Year', 'Name'])
df_title.head(10)

Unnamed: 0,Movie_Id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
5,6,1997.0,Sick
6,7,1992.0,8 Man
7,8,2004.0,What the #$*! Do We Know!?
8,9,1991.0,Class of Nuke 'Em High 2
9,10,2001.0,Fighter


### Conclusion 


The data being too large is impossible to load on my laptop (Reason - Pandas achieves its speed by holding the dataset in RAM when performing calculations, that’s why it comes with a certain limitation). I have just used 10k data per file which is not such a huge number.

In this section I have successfully loaded the data, converted each file into a dataframe, combined all the dataframes to form a single dataframe, which I am going to use further.