# Data Cleaning
In this notebook, we'll clean the data. That is, we'll get rid of the elements that we think are unnecessary. This doesn't necessarily make our analysis any easier, but certainly helps us focus on what's important.

## Movies Data
We start with the `movies.dat` file. Here, we'll remove the name of the movie, since it's rather irrelevant, but we will keep the movie's release year.

In [1]:
import os
import re

In [2]:
movie_file = os.path.join(os.path.pardir, 'movies.dat')

lines = None
# This has some weird mix of encodings, so pick some Western European encoding,
# and it works.
with open(movie_file, 'r', encoding='cp1252') as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        parts = line.split('::')
        
        # Get the year. The pattern is digits, followed by 3 characters: ")::".
        # Since the ) is also being matches, we splice it out.
        matches = re.search(r'\d+...$', parts[1])
        lines[i] = parts[0] + ',' + matches.group(0)[:-1] + ',' + parts[2]
    print(lines[:10])

["1,1995,Animation|Children's|Comedy\n", "2,1995,Adventure|Children's|Fantasy\n", '3,1995,Comedy|Romance\n', '4,1995,Comedy|Drama\n', '5,1995,Comedy\n', '6,1995,Action|Crime|Thriller\n', '7,1995,Comedy|Romance\n', "8,1995,Adventure|Children's\n", '9,1995,Action\n', '10,1995,Action|Adventure|Thriller\n']


In [3]:
movie_processed_file = os.path.join(os.path.pardir, 'processed', 'movies_processed.dat')
with open(movie_processed_file, 'w') as f:
    f.write('id,year,genres\n')
    f.writelines(lines)

## Ratings File
Next, we process the `training_ratings_for_kaggle_comp.csv` file.

In [4]:
ratings_file = os.path.join(os.path.pardir, 'training_ratings_for_kaggle_comp.csv')

with open(ratings_file, 'r') as f:
    lines = f.readlines()
    lines = [','.join(line.split(',')[:-1]) + '\n' for line in lines]
print(lines[:10])

['user,movie,rating\n', '2783,1253,5\n', '2783,589,5\n', '2783,1270,4\n', '2783,1274,4\n', '2783,741,5\n', '2783,750,5\n', '2783,924,5\n', '2783,2407,4\n', '2783,3070,3\n']


In [5]:
ratings_processed_file = os.path.join(os.path.pardir, 'processed', 'ratings_processed.csv')
with open(ratings_processed_file, 'w') as f:
    f.writelines(lines)

We'll leave the users file as it is, since we can try interesting things on the data given.

In [6]:
users_file = os.path.join(os.path.pardir, 'users.dat')
users_processed_file = os.path.join(os.path.pardir, 'processed', 'users_processed.dat')
with open(users_file, 'r') as f:
    lines = f.readlines()
    
    with open(users_processed_file, 'w') as g:
        g.write('id::gender::age_group::occupation::zip_code\n')
        g.writelines(lines)