# Cleaning the Movies Dataset

Import the movies dataset, which was scraped from the [IMDb Top 250 Movies](https://www.imdb.com/chart/top/) site on 10/30/2022 at 10:58pm EDT.  Specify the expected data type for each column so that missing and bad values will be detected.

In [1]:
import pandas

In [2]:
movies = pandas.read_csv("movies.csv",
                         dtype={'rank': int,
                                'title': str,
                                'year': int,
                                'rating': float})

Take a brief look at the data.

In [3]:
movies

Unnamed: 0,rank,title,year,rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,9.0
...,...,...,...,...
245,246,Dersu Uzala,1975,8.0
246,247,The Help,2011,8.0
247,248,Gandhi,1982,8.0
248,249,Aladdin,1992,8.0


## Check for Missing Data and Bad Values

Missing data and bad values are imported as NaN values.  Technically, there is no need to check for NaN values in the *rank* and *year* columns because they were imported as integers and a NaN value cannot be represented as an integer, so any missing or bad data in those columns would have caused the import to fail.

In [4]:
movies.dtypes

rank        int32
title      object
year        int32
rating    float64
dtype: object

The *title* and *rating* columns could contain NaN values because they were imported as string and float data, respectively.  Count the number of NaN values in each column.

In [5]:
movies.isnull().sum()

rank      0
title     0
year      0
rating    0
dtype: int64

Good, there were no NaN values in any column.  

## Check the *rank* Data

Check that the movies were ranked from 1 to 250.

In [6]:
(movies['rank'] == pandas.Series(range(1, 251))).all()

True

## Check the *year* Data

Check that *year* values were in the range from 1874 to 2022, because the oldest movie on IMDb was released in 1874, and *year* data cannot be in the future.

In [7]:
movies['year'].min()

1921

In [8]:
movies['year'].max()

2022

In [9]:
movies['year'].min() >= 1874 and movies['year'].max() <= 2022

True

## Check the *rating* Data

Check that *rating* values were in the range from 1 to 10, because every IMDb rating must be in that range.

In [10]:
movies['rating'].min()

8.0

In [11]:
movies['rating'].max()

9.2

In [12]:
movies['rating'].min() >= 1.0 and movies['rating'].max() <= 10.0

True

## Check for Duplicate Movies

In the movie industry, every movie title is unique within a given release year.  Combine the *title* and *year* columns together and check for duplicate entries.

In [13]:
title_with_year = movies['title'] + " (" + movies['year'].astype("string") + ")"
title_with_year

0      The Shawshank Redemption (1994)
1                 The Godfather (1972)
2               The Dark Knight (2008)
3         The Godfather Part II (1974)
4                  12 Angry Men (1957)
                    ...               
245                 Dersu Uzala (1975)
246                    The Help (2011)
247                      Gandhi (1982)
248                     Aladdin (1992)
249              The Iron Giant (1999)
Length: 250, dtype: string

In [14]:
title_with_year.duplicated().any()

False

Good, every movie was unique.

# Descriptive Statistics for Numeric Movie Data

In [15]:
movies.describe().round(2)

Unnamed: 0,rank,year,rating
count,250.0,250.0,250.0
mean,125.5,1986.49,8.25
std,72.31,25.23,0.24
min,1.0,1921.0,8.0
25%,63.25,1966.25,8.1
50%,125.5,1994.0,8.2
75%,187.75,2006.75,8.4
max,250.0,2022.0,9.2


Note that **std** indicated standard deviation and percentages indicated quartiles.

# Descriptive Statistics for String Movie Data

In [16]:
movies.describe(include=['O'])

Unnamed: 0,title
count,250
unique,250
top,The Shawshank Redemption
freq,1


Note that **top** indicated the most frequent *title* and **freq** indicated the frequency of that *title*.

The results showed that all of the *title* entries were unique, regardless of the *year* of release.