# Assignment 01 (100 points)
### Learning Outcomes 
By completing this assignment, you will gain and demonstrate skills relating to
- Working within the python and Jupyter notebook environment 
- Learning the basics of pandas’ dataframes 
- Extracting descriptive statistics from the data
- Plot basic histograms and judge the shape of different distributions

### Problem Description
As a data scientist for a Hollywood investment company you have been charged with creating a summary of the last 250 movies so that your boss can have a deeper understanding of the movie industry. The dataset that you will be working with includes 8 different variables
- North American box office revenue (boxoff)
- Production cost of the movie (prodcost)
- Income of the director (dirIncome)
- Gender of the director (dirGender) 
- Year of release (year)
- Month of release (month) 
- Movie genre (genre)
- Number of theatres (numTheatres)

DISCLAIMER: The data set is made up and different from last year!

### General Instructions
Unless otherwise stated,
- You are encouraged _to use numpy, pandas, and matplotlib methods (e.g., mean, etc.).
- Results in written answers should be rounded to 3 decimal places. 
- Before starting this assignment, it is important you go through the Installation and Assignment Instructions document.

### Learning about pandas and data frames 
In this homework, you will be expected to attend the lab and also read through portions of the “Python for Data Analysis” textbook. Make sure that you familiarize yourself with Numpy (Chapter 4). Then work through the aforementioned sections in Chapters 5, 6, 8 and 9 to get the basic usage of data frames. This is a lot to read, and not everything is 100% important. Nonetheless, I encourage you to start working through the essentials. To help you not get lost, we will point you to the relevant book section for the various tasks. However, you may have to go back and read some basics if you realize that you are missing understanding on some foundational concept.  

### Submit via OWL
Please use this jupyter notbook to fill in the answers below. Before submitting, please make sure you clear the output prior to submission (Cell Menu -> All Outputs -> Clear ). Submit your notebook file under the file name YourUserID_Assign1.ipynb 

Make sure you attach the file to your assignment; DO NOT put the code inline in the textbox.
Make sure that you develop your code with Python 3.7 as the interpreter. The TA will not endeavor to fix code that uses earlier versions of Python (e.g. 2.7). Make sure that your code includes all statements that it requires to work properly when calling Cell->Run All. 

### General Marking criteria: 
- Working code in Python 3.7 
- Written answer in full English sentences, if required 
- For more complicated questions, show how you arrived at your answer
- Informative variable names 
- For larger pieces of code, proper comments and explanations 
- All figures require axis labels + a caption 

## Task1: load the data into dataframe (? \ 5 points)

### Question1. import pandas as pd (pg.13) (? \1 points)

In [None]:
import pandas as pd

### Question2. load the data file as a dataframe object (pg. 155 – 159) (? \ 4 points)

In [None]:
# load the data file
df = pd.read_csv('movieDataset.csv')

# show dataframe briefly
display(df.head())

## Task2: Understanding Pandas dataframe structure (? \ 25 points)

### Question1. learning how to access column names (pg. 116) (? \ 2 points)
Print the names of the columns in the data frame you loaded

In [None]:
print(df.columns)

### Question2. Retrieving a column (pg. 116) (? \ 3 points)
Print the data in the directorIncome column to the screen 

In [None]:
print(df['dirIncome'])

### Question3. Retrieving a data entry (i.e. a row) (pg. 117) (? \ 5 points)
Print the data in the 15 row 

In [None]:
# To get the 15th row, use iloc[14] since row numbers start from 0
print(df.iloc[14])

### Question4. Retrieving multiple rows (pg. 125 – 128) (? \ 4 points )
Print the data in rows 99 to 104

In [None]:
print(df.iloc[98:104])

### Question5. Retrieve the rows that meet a specific condition (pg. 126 – 127) (? \ 7 points)
Print the data entries where the director’s income is greater than or equal to 5 million dollars

In [None]:
df.loc[df['dirIncome'] >= 5]

### Question6. creating a new column (pg. 128 – 129 and pg. 261) (? \ 4 points)
Add a new column to the dataframe (profit). This column should be the total box office income minus the production cost.

In [None]:
df['profit'] = df['boxoff'] - df['prodcost']

#display it 
display(df.head())

## Task3: Descriptive statistics (? \ 40 points)

### Question1. For the director Income column, print the median, mean, minimal value and maximal value (? \ 16 points)

#### median for directorIncome column (? \ 4 points)

In [None]:
# get median for directorIncome column
med = df['dirIncome'].median()

# print median
print(f"The median for directorIncome column is {med}")

# round to three decimals and then print it
print("After rounding to 3 decimal places, it is", round(med,3))

#### mean for directorIncome column (? \ 4 points)

In [None]:
# get mean for directorIncome column
mean = df['dirIncome'].mean()

# print mean
print(f"The median for directorIncome column is {mean}")

# round to three decimals and then print it
print("After rounding to 3 decimal places, it is",round(mean,3))

#### minimum for directorIncome column (? \ 4 points)

In [None]:
# get minimum for directorIncome column
min = df['dirIncome'].min()

# print minimum
print(f"The median for directorIncome column is {min}")

# round to three decimals and then print it
print("After rounding to 3 decimal places, it is",round(min,3))

#### maximum dor directorIncome column (? \ 4 points)

In [None]:
# get minimum for directorIncome column
max = df['dirIncome'].max()

# print minimum
print(f"The median for directorIncome column is {max}")

# round to three decimals and then print it
print("After rounding to 3 decimal places, it is",round(max,3))

### Question2. Written answer: From these values, what can you guess about the Skewness of the distribution? Justify your answer (? \ 8 points)

Median for directorIncome column is around 1.951 and mean for directorIncome column is around 2.128. 

When mean > median, we can tell the distribution is right-skewed.

### Question3. Print out the data for the movie that has the highest box-office income (? \ 16 points)
(Hint: you may first have to determine the maximum value in the box-office column and then use retrieval strategies to print out all the data)

In [None]:
# get the highest box-office income first
max = df['boxoff'].max()

# display the answer
display(df.loc[df['boxoff'] == max])

## Task 4: plotting histograms and boxplots (? \ 30 points)
### Preliminaries
In order to set up your notebook environment so it includes figures inline write the following code:

In [None]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline


### Question1. Create a histogram of director Income (pgs. 238 – 239). Choose the number of bins so that the plot is informative. (? / 10 points)

Written response: Justify why you chose this number.

In [None]:
# plot histogram
df['dirIncome'].hist(bins = 15)
plt.title('Histogram of Director Income')
plt.xlabel('Director Income')
plt.ylabel('Number')

The number of bins should be set to a value so that the resulting distribution does not reveal too much details and at the same time shows the general trend of the data. 

Figure 1. Histogram of Director Income

### Question2. Create a boxplot of director Income (plt.boxplot) (? \ 6 points)

In [None]:
# plot boxplot
plt.boxplot(df['dirIncome'])
plt.title('Boxplot of Director Income')
plt.ylabel('Number')

Figure 2. Boxplot of Director Income

### Question3. Create a violinplot of director income (plt.violinplot) (? \ 6 points )

In [None]:
# plot violin plot
plt.violinplot(df['dirIncome'])
plt.title('Violinplot of Director Income')
plt.xlabel('Director Income')
plt.ylabel('Number')

Figure 3. Violinplot of Director Income

### Question4. written response: From the three graphs, describe the distribution of director income in terms of range, skewness, and outliers. Which visualization shows which aspects best? (? \ 8 points)

The violinplot of director income shows the range which is from around 0.2 to around 5.9. From the histogram of director income, we can see this distribution is right-skewed. We can find suspected outliers over the upper whisker from the boxplot of director income.

Histograms are convenient for describing the shape of the data distribution. We can check the modality and skewness.

Boxplots show suspected outliers, whiskers, and outliers. 

Violin plots show the range very well.