# Movie Ratings Lab (Text Manipulation)

### Intro and objectives
#### Apply the concepts learned so far in a real use case

### In this lab you will learn:
1. Advanced processing of text

## What I hope you'll get out of this lab
* Gain experience processing text from real datasets

In [4]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

### GroupLens Research provides a number of collections of movie ratings data collected from users of MovieLens in the late 1990s and early 2000s. The data provides movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification, and occupation). Such data is often of interest in the development of recommendation systems based on machine learning algorithms.

### While we do not explore machine learning techniques in detail in this book, You will learn how to slice and dice datasets like these into the exact form you need.

### The MovieLens 1M dataset contains one million ratings collected from six thousand users on four thousand movies. It’s spread across three tables: ratings, user information, and movie information. These files contain 1,000,209 anonymous ratings of approximately 3,900 movies  made by 6,040 MovieLens users who joined MovieLens in 2000.


#### RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

#### USERS FILE DESCRIPTION
================================================================================

User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

#### MOVIES FILE DESCRIPTION
================================================================================

Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

In [5]:
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/users2.txt", sep="::",header=None, names=unames)






  users = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/users2.txt", sep="::",header=None, names=unames)


In [6]:
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/ratings2.txt", sep="::",header=None, names=rnames)



  ratings = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/ratings2.txt", sep="::",header=None, names=rnames)


In [7]:
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/movies2.txt", sep="::",header=None, names=mnames)

  movies = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/movies2.txt", sep="::",header=None, names=mnames)


In [8]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [9]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [10]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


#### Analyzing the data spread across three tables is not a simple task; for example, suppose you wanted to compute mean ratings for a particular movie by gender identity and age. As you will see, this is more convenient to do with all of the data merged together into a single table. Using pandas’s merge function, we first merge ratings with users and then merge that result with the movies data. pandas infers which columns to use as the merge (or join) keys based on overlapping names:

In [11]:
RatingsUsersDataFrame=pd.merge(ratings, users, on='user_id')

In [12]:
RatingsUsersDataFrame

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip
0,1,1193,5,978300760,F,1,10,48067
1,1,661,3,978302109,F,1,10,48067
2,1,914,3,978301968,F,1,10,48067
3,1,3408,4,978300275,F,1,10,48067
4,1,2355,5,978824291,F,1,10,48067
...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106
1000205,6040,1094,5,956704887,M,25,6,11106
1000206,6040,562,5,956704746,M,25,6,11106
1000207,6040,1096,4,956715648,M,25,6,11106


In [13]:
RatingsUsersDataFrameMovies=pd.merge(RatingsUsersDataFrame,movies, on='movie_id')

In [14]:
RatingsUsersDataFrameMovies

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1000205,6040,1094,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
1000206,6040,562,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1000207,6040,1096,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


## Let's extract all genres contained in every review

In [36]:
RatingsUsersDataFrameMovies['genres'].str.split(pat='|',expand=True)

Unnamed: 0,0,1,2,3,4,5
0,Drama,,,,,
1,Animation,Children's,Musical,,,
2,Musical,Romance,,,,
3,Drama,,,,,
4,Animation,Children's,Comedy,,,
...,...,...,...,...,...,...
1000204,Comedy,,,,,
1000205,Drama,Romance,War,,,
1000206,Comedy,Drama,,,,
1000207,Drama,,,,,


## Let's extract the release date (REGEX based)
### for this we rely on google gemini to help us determine the right REGEX expression to separate the four digit number embedded in the title column.
### The prompt you need to submit is similar to the following: " define a regex expression to extract the 4 digit number from the title column in the dataframe: RatingsUsersDataFrameMovies "

In [18]:
# prompt: define a regex expression to extract the 4 digit number from the title column in the dataframe: RatingsUsersDataFrameMovies

import re

def extract_year(title):
  match = re.search(r'\((\d{4})\)', title)
  if match:
    return int(match.group(1))
  else:
    return None

RatingsUsersDataFrameMovies['year'] = RatingsUsersDataFrameMovies['title'].apply(extract_year)
RatingsUsersDataFrameMovies.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres,year
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical,1996
2,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance,1964
3,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama,2000
4,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998


## Let's display the title and the rating year using f-strings

In [30]:
pd.options.display.max_colwidth = 500
RatingsUsersDataFrameMovies[['title','rating','user_id']].sample(10).apply(lambda x:f" The movie: {x['title']} got a rating of: {x['rating']} from user: {x['user_id']} ",axis=1)

Unnamed: 0,0
461755,"The movie: Killing Fields, The (1984) got a rating of: 3 from user: 2848"
133279,The movie: Night of the Living Dead (1968) got a rating of: 4 from user: 860
472152,The movie: Star Trek: The Wrath of Khan (1982) got a rating of: 4 from user: 2906
22985,"The movie: Elephant Man, The (1980) got a rating of: 5 from user: 166"
373644,"The movie: Blues Brothers, The (1980) got a rating of: 2 from user: 2180"
778050,The movie: Glory (1989) got a rating of: 3 from user: 4647
404648,"The movie: South Park: Bigger, Longer and Uncut (1999) got a rating of: 3 from user: 2419"
412643,The movie: 2001: A Space Odyssey (1968) got a rating of: 1 from user: 2484
224058,The movie: Man Bites Dog (C'est arrivé près de chez vous) (1992) got a rating of: 5 from user: 1356
574356,"The movie: City of Lost Children, The (1995) got a rating of: 4 from user: 3516"
