
### **Data Preprocessing for Movie Lens Dataset**

The Movie Lens dataset is a popular dataset used for recommendation systems and movie ratings analysis. This code aims to preprocess the dataset before further analysis or modeling.


**Preprocessing Steps:**

1. **Loading the dataset:** The code loads the Movie Lens dataset from a specific file path.

2. **Handling missing values:** The code handles any missing values in the dataset by applying appropriate techniques such as imputation or removal.

3. **Removing duplicates:** The code identifies and removes any duplicate entries in the dataset to ensure data integrity.

4. **Encoding categorical variables:** The code encodes categorical variables in the dataset into numerical representations, making them suitable for analysis or modeling.

5. **Scaling numerical variables:** The code scales numerical variables in the dataset to a common range, ensuring that they have a similar impact during analysis or modeling.



By performing these preprocessing steps, the Movie Lens dataset is prepared for further analysis or modeling tasks.

**Author:** [Your Name]

**Date:** [Current Date]

In [1]:
#imports
import pandas as pd
import dask.dataframe as dd

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [2]:
#Loading Dataset
ratings = pd.read_csv('Data/ratings.csv')
tags = pd.read_csv('Data/tags.csv')
movies = pd.read_csv('Data/movies.csv')
links = pd.read_csv('Data/links.csv')
genome_scores = pd.read_csv('Data/genome-scores.csv')
genome_tags = pd.read_csv('Data/genome-tags.csv')


ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

## Data Cleaning

In [None]:
print(ratings.isnull().sum())
print(tags.isnull().sum())
print(movies.isnull().sum())
print(links.isnull().sum())
print(genome_scores.isnull().sum())
print(genome_tags.isnull().sum())

# Example of handling missing values (if any)
ratings.dropna()
tags.dropna()
movies.dropna()
links.dropna()
genome_scores.dropna()
genome_tags.dropna()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
userId        0
movieId       0
tag          17
timestamp     0
dtype: int64
movieId    0
title      0
genres     0
dtype: int64
movieId      0
imdbId       0
tmdbId     126
dtype: int64
movieId      0
tagId        0
relevance    0
dtype: int64
tagId    0
tag      0
dtype: int64


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


In [None]:
#Checking the data
print("Ratings Dataset\n")
print("Columns: ",ratings.columns)
print(ratings.head(5))

print("\nTags Dataset\n")
print("Columns: ", tags.columns)
print(tags.head(5))

print("\nMovies Dataset\n")
print("Columns: ", movies.columns)
print(movies.head(5))

print("\nLinks Dataset\n")
print("Columns: ", links.columns)
print(links.head(5))

print("\nGenome Scores Dataset\n")
print("Columns: ", genome_scores.columns)
print(genome_scores.head(5))

print("\nGenome Tags Dataset\n")
print("Columns: ", genome_tags.columns)
print(genome_tags.head(5))




Ratings Dataset

Columns:  Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
   userId  movieId  rating   timestamp
0       1        1     4.0  1225734739
1       1      110     4.0  1225865086
2       1      158     4.0  1225733503
3       1      260     4.5  1225735204
4       1      356     5.0  1225735119

Tags Dataset

Columns:  Index(['userId', 'movieId', 'tag', 'timestamp'], dtype='object')
   userId  movieId            tag   timestamp
0      10      260   good vs evil  1430666558
1      10      260  Harrison Ford  1430666505
2      10      260         sci-fi  1430666538
3      14     1221      Al Pacino  1311600756
4      14     1221          mafia  1311600746

Movies Dataset

Columns:  Index(['movieId', 'title', 'genres'], dtype='object')
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Wa

## Compiling Ratings and tags dataset

In [None]:
import dask.dataframe as dd

# Load the ratings dataset
ratings = dd.read_csv('Data/ratings.csv')

# Group by movieId and calculate the average rating for each movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()

# Rename the columns for clarity
average_ratings = average_ratings.rename(columns={'rating': 'average_rating'})

#Saving the new_ratings.csv file
average_ratings.to_csv('Data/new_ratings.csv', single_file = True)


['/workspaces/look4movie/Data/new_ratings.csv']

In [None]:
print("Average Ratings Dataset\n")
print("Columns: ",average_ratings.columns)
print(average_ratings.head(5))

Average Ratings Dataset

Columns:  Index(['movieId', 'average_rating'], dtype='object')
   movieId  average_rating
0        1        3.893508
1        2        3.278179
2        3        3.171271
3        4        2.868395
4        5        3.076957


In [None]:
import dask.dataframe as dd
import pandas as pd

# Load the tags dataset
tags = dd.read_csv('Data/tags.csv')

# Define the function to concatenate tags
def concatenate_tags(df):
    return pd.DataFrame({
        'tags': [', '.join(df['tag'].dropna().unique())]
    })

# Group by movieId and concatenate the tags into a single string
tag_counts = tags.groupby('movieId').apply(concatenate_tags, meta={'tags': 'object'})

# Reset the index to make movieId a regular column
tag_counts = tag_counts.reset_index()

# Compute and display the results
tag_counts_computed = tag_counts.compute()

# Drop the 'level_1' column
tag_counts_computed = tag_counts_computed.drop(columns=['level_1'])

# Display the resulting DataFrame
print("Tag Counts Dataset without level_1\n")
print(tag_counts_computed.head())

#saving the new_tags.csv file
tag_counts_computed.to_csv('Data/new_tags.csv', index=False)

Tag Counts Dataset without level_1

   movieId                                               tags
0        1  animation, friendship, toys, Disney, Pixar, CG...
1        2  animals, based on a book, fantasy, magic board...
2        3  sequel, moldy, old, old age, old men, wedding,...
3        4  characters, chick flick, girl movie, revenge, ...
4        5  family, pregnancy, wedding, 4th wall, aging, b...


In [None]:
import dask.dataframe as dd

# Load the datasets
genome_scores = dd.read_csv('Data/genome-scores.csv')
genome_tags = dd.read_csv('Data/genome-tags.csv')

# Merge genome scores with genome tags
genome = dd.merge(genome_scores, genome_tags, on='tagId')

# Compute the merged dataframe
genome_computed = genome.compute()

# Display the first few rows
print(genome_computed.head(20))

#saving the new_genome.csv file
genome_computed.to_csv('Data/new_genome.csv', index=False)


    movieId  tagId  relevance              tag
0         1      1    0.03200              007
1         1      2    0.02225     007 (series)
2         1      3    0.07000     18th century
3         1      4    0.05900            1920s
4         1      5    0.12300            1930s
5         1      6    0.13100            1950s
6         1      7    0.06175            1960s
7         1      8    0.19550            1970s
8         1      9    0.26625            1980s
9         1     10    0.03300     19th century
10        1     11    0.56800               3d
11        1     12    0.13700             70mm
12        1     13    0.24900              80s
13        1     14    0.00650             9/11
14        1     15    0.02175          aardman
15        1     16    0.17975  aardman studios
16        1     17    0.01400         abortion
17        1     18    0.07575           absurd
18        1     19    0.65925           action
19        1     20    0.30250    action packed


In [5]:
import dask.dataframe as dd

# Load the datasets
movies = dd.read_csv('Data/movies.csv')
new_ratings = dd.read_csv('Data/new_ratings.csv')
new_tags = dd.read_csv('Data/new_tags.csv')
new_genome = dd.read_csv('Data/new_genome.csv')

# Merge the movies dataset with the new_ratings dataset
movies_ratings = dd.merge(movies, new_ratings, on='movieId')

# Merge the movies_ratings dataset with the new_tags dataset using left join to include all movies
movies_ratings_tags = dd.merge(movies_ratings, new_tags, on='movieId', how='left')

# Compute the merged dataframe
movies_ratings_tags_computed = movies_ratings_tags.compute()

# Display the first few rows
print(movies_ratings_tags_computed.head())


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  Unnamed: 0  average_rating  \
0  Adventure|Animation|Children|Comedy|Fantasy           0        3.893508   
1                   Adventure|Children|Fantasy           1        3.278179   
2                               Comedy|Romance           2        3.171271   
3                         Comedy|Drama|Romance           3        2.868395   
4                                       Comedy           4        3.076957   

                                                tags  
0  animation, friendship, toys, Disney, Pixar, CG...  
1  animals, based on a book, fantasy, magic board...  
2  sequel, moldy, old, old age, old men, wedding,...  
3  charact

In [6]:
#columns 
movies_ratings_tags_computed.columns

#Removing the extra unnamed column
movies_ratings_tags_computed = movies_ratings_tags_computed.drop(columns=['Unnamed: 0'])

#saving the final dataset
movies_ratings_tags_computed.to_csv('Data/final_dataset.csv', index=False)

In [7]:
import pandas as pd
# Displaying the final two datasets 

#Loading the final dataset 
final_dataset = pd.read_csv('Data/final_dataset.csv')

print("Final Dataset\n")
print("Columns: ",final_dataset.columns)
print(final_dataset.head(10))

#loadint the genome dataset
genome = pd.read_csv('Data/new_genome.csv')

print("Genome Dataset\n")
print("Columns: ",genome.columns)
print(genome.head(10))

Final Dataset

Columns:  Index(['movieId', 'title', 'genres', 'average_rating', 'tags'], dtype='object')
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   
5        6                         Heat (1995)   
6        7                      Sabrina (1995)   
7        8                 Tom and Huck (1995)   
8        9                 Sudden Death (1995)   
9       10                    GoldenEye (1995)   

                                        genres  average_rating  \
0  Adventure|Animation|Children|Comedy|Fantasy        3.893508   
1                   Adventure|Children|Fantasy        3.278179   
2                               Comedy|Romance        3.171271   
3                         Comedy|Drama|Romance        2.868395   
4             