## Movies data

### Explore and clean the data
---

In [59]:
####  Import statements
import pandas as pd

df = pd.read_csv("cost_revenue_dirty.csv")
df.sample(5)    # Checking a random sample of 5

Unnamed: 0,Rank,Release_Date,Movie_Title,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
3934,286,7/29/2011,The Smurfs,"$110,000,000","$563,749,323","$142,614,158"
3202,3569,8/15/2008,Star Wars: The Clone Wars,"$8,500,000","$68,695,443","$35,161,554"
260,1676,12/14/1979,1941,"$32,000,000","$94,875,000","$34,175,000"
4181,1003,6/29/2012,Ted,"$50,000,000","$556,016,627","$218,665,740"
4454,5263,8/23/2013,Sparrows Dance,"$175,000","$2,602","$2,602"


#### Initial checks

In [79]:
#### Initial checks
df.head()
df.shape
df.columns      # Checks are fine
df.dtypes       # Last three usd columns need to be converted.

#Adjustments

    # Convert the currency to numbers
columns = ["USD_Production_Budget", "USD_Worldwide_Gross", "USD_Domestic_Gross"]
for column in columns:
    df[column]= df[column].astype(str).str.replace("$", "")
    df[column]= df[column].astype(str).str.replace(",", "")
    df[column] =pd.to_numeric(df[column])

    # convert the date to datetime
df["Release_Date"] = pd.to_datetime(df["Release_Date"])
df.dtypes


Rank                              int64
Release_Date             datetime64[ns]
Movie_Title                      object
USD_Production_Budget             int64
USD_Worldwide_Gross               int64
USD_Domestic_Gross                int64
dtype: object

In [72]:
# Secondary checks
    # check for missing data
df.isna().sum()
        # No missing data so no changes needed

    # checks for duplicates
        # Helps you to see them ( keep= false mean to keep all, can do first and the last.)
df [df[["Movie_Title","Release_Date"]].duplicated(keep = False)].sort_values("Movie_Title")
            # Intially did moveit title but the release dates are different.
            ## When accounting for release date and movie name there is only one duplicates
df[["Movie_Title", "Release_Date"]].duplicated().value_counts()
            # Only one item that identical

df.drop_duplicates(subset =["Movie_Title", "Release_Date"], keep ="first", inplace = True)
df[["Movie_Title", "Release_Date"]].duplicated().value_counts()


False    5390
Name: count, dtype: int64

#### Descriptive stats

##### Challenge 1
What is the average production budget of the films in the data set?

What is the average worldwide gross revenue of films?

What were the minimums for worldwide and domestic revenue?

Are the bottom 25% of films actually profitable or do they lose money?

What are the highest production budget and highest worldwide gross revenue of any film?

How much revenue did the lowest and highest budget films make?

In [82]:
df.head()
pd.set_option('display.float_format', '{:,.2f}'.format)
df.describe()

Unnamed: 0,Rank,Release_Date,USD_Production_Budget,USD_Worldwide_Gross,USD_Domestic_Gross
count,5390.0,5390,5390.0,5390.0,5390.0
mean,2695.52,2003-09-19 08:04:21.818181888,31119487.81,31119487.81,31119487.81
min,1.0,1915-08-02 00:00:00,1100.0,1100.0,1100.0
25%,1348.25,1999-12-02 06:00:00,5000000.0,5000000.0,5000000.0
50%,2695.5,2006-06-23 00:00:00,17000000.0,17000000.0,17000000.0
75%,4042.75,2011-11-23 00:00:00,40000000.0,40000000.0,40000000.0
max,5391.0,2020-12-31 00:00:00,425000000.0,425000000.0,425000000.0
std,1556.14,,40525356.93,40525356.93,40525356.93
