## Netflix Movie Recommendation

### Business Problem

We want to recommend movies to users by
- By predicting the rating for a movie which has been unrated by the user
- Build a model with RMSE (between actual and predicted ratings) as the metric

The ratings prediction need not be calculated instantaneosly as these ratings would be precomputed on a daily basis (May be Nightly)

### Data

The dataset can be downloaded from https://www.kaggle.com/netflix-inc/netflix-prize-data

Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

Columns : __CustomerID, Rating, Date__

- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
- Ratings are on a five star (integral) scale from 1 to 5.
- Dates have the format YYYY-MM-DD.

Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

- MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
- YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
- Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

### ML Problem

If we pose this a ML problem,we can categorize it as :
- Recommendation Problem to Recommend Movies
- Regression task to predict the ratings for an unrated movie

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from itertools import islice
import re
from tqdm import tqdm
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

### Exploratory Data Analysis

#### Data Preprocessing

In [2]:
# List all files 
for file in os.listdir('./data'):
    print('Filename: {0} --> {1}'.format(file.ljust(30).rstrip(),
                                       str(round(os.path.getsize('./data/'+file)/1000000,2))+'MB'))

Filename:  - Copy.gitignore --> 0.0MB
Filename: .gitignore --> 0.0MB
Filename: .ipynb_checkpoints --> 0.0MB
Filename: combined_data_1.txt --> 495.03MB
Filename: combined_data_2.txt --> 555.21MB
Filename: combined_data_3.txt --> 465.16MB
Filename: combined_data_4.txt --> 552.54MB
Filename: movie_titles.csv --> 0.58MB
Filename: probe.txt --> 10.78MB
Filename: qualifying.txt --> 52.45MB


In [3]:
# Sample Read
with open('./data/combined_data_1.txt') as file:
    head = list(islice(file,10))
print('First 10 rows: ......')
print([h for h in head])
file.close()

First 10 rows: ......
['1:\n', '1488844,3,2005-09-06\n', '822109,5,2005-05-13\n', '885013,4,2005-10-19\n', '30878,4,2005-12-26\n', '823519,3,2004-05-03\n', '893988,3,2005-11-17\n', '124105,4,2004-08-05\n', '1248029,3,2004-04-22\n', '1842128,4,2004-05-09\n']


In [4]:
# get all data files
files = [f for f in os.listdir('./data/') \
         if re.match('combined_data.*\.txt',f)]
files

['combined_data_1.txt',
 'combined_data_2.txt',
 'combined_data_3.txt',
 'combined_data_4.txt']

In [5]:
# Write a file with the final data
start_time = datetime.now()
if not os.path.isfile('./data/all_data_combined.csv'):
    all_data_combined = open('./data/all_data_combined.csv',mode = 'w')
# combine all the data into the format movieid','userid','rating','date'
    row = list()
    for file in tqdm(files):
        with open(os.path.join('./data/',file)) as f:
            for line in f:
                line = line.strip()
                if line.endswith(':'):
               # Then all that follow are ratings untill me revisit the same ':' pattern
                    movie_id = line.replace(':','')
                else:
                    row = [word for word in line.split(',')]
                    row.insert(0,movie_id)
                    all_data_combined.write(','.join(row))
                    all_data_combined.write('\n')
        print('All data has been combined')
    all_data_combined.close()
print('Total Time Taken: {}'.format(datetime.now() - start_time))
                

  0%|          | 0/4 [00:00<?, ?it/s]

All data has been combined


 25%|██▌       | 1/4 [01:49<05:27, 109.22s/it]

All data has been combined


 50%|█████     | 2/4 [03:44<03:42, 111.13s/it]

All data has been combined


 75%|███████▌  | 3/4 [05:17<01:45, 105.45s/it]

All data has been combined


100%|██████████| 4/4 [08:43<00:00, 135.81s/it]


Total Time Taken: 0:08:43.693608
