# Module 1: Python Programming
## A. Welcome to Jupyter (a.k.a. Ipython Notebooks)
Take a while to adjust your bearings. Study the icons above.

There are two major types of cells:

1) Markdown cells - simple text. One can do html tags like <b>BOLD</b> or latex like $\beta$.

2) Code cells - cells where we can run code.

<b>Shortcuts</b>

1) <b>CTRL-M</b> then <b>H</b> to see help

2) <b>CTRL-M</b> then <b>S</b> to save notebook

3) <b>CTRL-ENTER</b> to Run Code but stay in the same cell

4) <b>SHIFT-ENTER</b> to Run Code and advance to the next cell

5) Using <b>%pylab inline</b> preceeding everything else in the notebook imports already matplotlib and numpy. It also enables our graphics to be part of the notebook.

6) You can use <b>TAB</b> to see available functions. You can use <b>SHIFT-TAB</b> repeatedly for the documentation.

In [None]:
### This function is needed ONLY IF you're working in Anaconda locally. Not needed as much for Colabs
%pylab inline

## Variables and Data Types

Python uses five standard data types:

### Numbers

In [None]:
varNum = 123
pi = 3.14159

varNum is an Integer, thus it does not handle numbers with decimal places while pi is a Float where values in the decimal place are handled.

### Strings

In [None]:
varString = "Hello World!"
varText = 'This is a String'
print(varString)
print("The length of varString is", len(varString))

Strings may be declared with a single quote (') or double quote ("), some even use triple double quotes("""). One may use them interchangeable but some prefer to follow a specific format.

### Lists

In [None]:
varList = ["abc", 123]
print(varList)
print(len(varList))

In [None]:
print(varList[0])
print(len(varList[0]))

You can think of Lists as similar to ArrayLists where the index starts at 0 and you can obtain the contents of a list by using brackets that contain the index of the element. You may also append items in the list and remove them as well.

### Tuples

In [None]:
varTuple = ('abc', 123, "HELLO")
print(varTuple)
print(len(varTuple))
print(varTuple[0])

It may seem like there are no differences between Tuples and Lists other than Tuples use parenthesis while lists use brackets, but actually there are minor differences. For one thing, Tuples are fixed structures thus do not have the luxury of Lists to append or remove elements. Generally Lists have a lot of other functions readily available as opposed to using Tuples.

<b>HINT:</b> You can try to type <b><i>varList.</i></b> in one line as well as <b><i>varTuple.</i></b> and press <b>TAB</b> after the period (.) in order to view possible functions you can call from that variable. You may also try to press <b>CTRL + TAB</b> when the text cursor.

In [None]:
varList.append("HELLO")
print(varList)
print(len(varList))

However Tuples actually use less space in the memory as opposed to Lists, resulting in faster processing. One thing to take note of is that one would usually use Tuples when the size of the contents are static as opposed to Lists where one can use it to continuously modify the size and elements.

In [None]:
print(varList)
print(varList.__sizeof__())
print(varTuple)
print(varTuple.__sizeof__())

### Dictionaries

In [None]:
var = 3
varDict = {'first':1, '2':'2nd', 3:var}
varDict

In [None]:
# LOOK HERE!
print(varDict['first'])

print(varDict['2'])

print(varDict[3])

You may also declare contents of dictionaries individually

In [None]:
varDict = {}
varDict['first'] = 1
varDict['2'] = '2nd'
varDict[3] = var
print(varDict[3])

If you have experience in using JavaScript Object Notation or JSON, Python's implementation of Dictionaries are quite similar to that. You may reference an element by inserting the label of the keypair.

## Arithmetic

Python uses basic arithmetic functions which are normally present on most if not all programming languages.

### Addition

In [None]:
a = 5 + 3
a

### Subtraction

In [None]:
a = 5 - 3
a

### Multiplication

In [None]:
a = 5 * 3
a

### Exponent

In [None]:
a = 5 ** 3
a

### Division

In [None]:
a = 5 / 3
a

### Modulus Division

In [None]:
a = 5 % 3
a

### Integer Division

In [None]:
a = 5 // 3
a

### String Concatenation

In [None]:
a = 'Hello ' + 'World!'
a

Strings may also be appended with the use of the plus <b>(+)</b> symbol

### Complex Expressions

In [None]:
a = 3 + 5 - 6 * 2 / 4
a

## Challenge! Write the following to code

$$ g(z) = \frac{1}{1+e^{-z}}  $$

1) z = 8, and e = 2.718 should be equal to 0.9996643716832646

2) z = -2, and e = 2.718 should be equal to 0.1192246961081721

In [None]:
z = 8
1 / (1 + 2.718**-z)

In [None]:
z = -2
1 / (1 + 2.718**-z)

<b><i>TRIVIA</i></b>: The value <b>e</b>, also called <b>Euler's number</b>, is a mathematical constant representing an irrational number that is approximately <b>2.71828</b>. Irrational, meaning the constant <b>e</b> is a real number that is unending and is unable to accurately be represented as a fraction, similar to that of <b>pi</b>.

## C. Control Statements and Data Structures
## Conditional Statements
In Python, curly brackets are not used to designate that multiple commands are inside a conditional statement, instead uniform spacing is used. Please take note however that the composition of the uniform spacing must be the same, such that if 4 spaces are being used, even though 4 spaces may have a visually similar result as a tab, interchanging them would produce an error statement

### String Condition

In [None]:
x = "Hello World!"

if x == 'Hello World!':
    print("var x is Hello World!")
else:
    print("var x is not Hello World!")

### Numerical Condition

In [None]:
x = 10

if x == '10':
    print("var x is a String")
elif x == 10:
    print("var x in an Integer")
else:
    print("var x is none of the above")

### Multiple Conditions

In [None]:
x = 10

if x > 5 and x < 15 and x == 10:
    print("var x is really 10!")
else:
    print("var x is not really 10")

In [None]:
x = 10

if x == 10 or x == 20:
    print("var x can be 10 or 20")
else:
    print("var x is not 10 nor 20")

## Loops
Similar with that of Conditional Statements, commands within a loop are designated by having a uniform spacing.
### For Loops

In [None]:
for var in range(12, 24, 3):
    print(var)

<b>NOTE:</b> The command <b>range(0,5,2)</b> is equivalent to all numbers from 0 incremented by 2 until it reaches the number less than 5

<b>NOTE:</b> range([start], stop, [step])

# Challenge
Fizzbuzz on lists - Given number $n$, populate a list $L$ which contains numbers 1 to $n$.

Additionally, for each element $i \in L$, replace $i$ with "Fizz" if $i$ is divisible by 3.

If $i$ is divisible by 5, replace it with "Buzz".
For multiples of 3 and 5, replace it with "FizzBuzz".

In [None]:
myList = []
for x in range(1, 101):
    if x % 15 == 0:
        myList.append("Fizzbuzz")
    elif x % 3 == 0:
        myList.append("Fizz")
    elif x % 5 == 0:
        myList.append("Buzz")
    else:
        myList.append(x)

## List Generators and Comprehension
![Comprehension Syntax](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/_images/listComprehensions.gif)

In [None]:
[v for v in range(1, 100, 5)]

In [None]:
my_list = range(10)
[v**2 for v in my_list if v % 2 == 0]

In [None]:
my_list = ["banana", "apple", "sykes", "avaya", "coffee"]
{index:v for index, v in enumerate(my_list) if "e" in v}

## Slicing

In [None]:
varList = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
varList[:5]

In [None]:
varList[5:]

In [None]:
varList[:-2]

In [None]:
varList[-2:]

In [None]:
print(varList[2:-2])

In [None]:
varList[2:8:2]

<b>NOTE</b> <i>list([start]: end : [
step])

## D. Functions

Functions use the following notation:

def <i>function_name</i>:<br>
<pre><i> input commands here </i></pre>

Here is a sample function. np.random.randint(a, [b]) outputs uniformly random values from $[a,b) $

In [None]:
def remainder(n, m):
    while True:
        if n - m < 0:
            return n
        else:
            n = n - m

In [None]:
remainder(10, 4)

# Enter Pandas

In [None]:
import pandas as pd

# in kaggle, data is stored in ../input
data = pd.read_csv("../input/imdb-5000-movie-dataset/movie_metadata.csv")

In [None]:
data.shape

In [None]:
data.columns

## Slicing data frames

In [None]:
data[:4]

## Indexing Columns

In [None]:
data.director_name[:4]

In [None]:
data["director_name"][:4]

In [None]:
cols = ["movie_title","director_name"]
data[cols][:5]

In [None]:
data.loc[10:11]

In [None]:
data.loc[10:11, "actor_2_name"]

## Indexing Rows

In [None]:
data.loc[10:12]

# Finding empty columns

In [None]:
data.isnull().sum()

## Find Movies by James Cameron

In [None]:
data[data["director_name"] == 'James Cameron']

# Find movies directed by any kind of "sam"

In [None]:
data[data.director_name.str.contains("Sam").fillna(False)]

## Sort films by gross earnings

In [None]:
sorted_data = data.sort_values(by="gross", ascending=False)
sorted_data[:5]

## Get the top 5 films of Michael Bay

In [None]:
data[data["director_name"] == "Michael Bay"].sort_values(by="gross", ascending=False)[:5]

In [None]:
data[data["director_name"] == "Michael Bay"].sort_values(by="gross", ascending=False).head(5)["movie_title"]

## Multiple Conditions: Find films from the Canada that have Hugh Jackman as the actor_1_name

In [None]:
data[(data['country'] == 'Canada') & (data['actor_1_name'] == 'Hugh Jackman')]

# Derived Data

In [None]:
data["duration_hours"] = data["duration"] / 60

# Output to Excel / CSV

In [None]:
data.to_csv("my_data.csv", index=False)
# data.to_excel("my_data.xlsx")

# Homework - Answer Key

In [None]:
# 1) Compute sales (gross - budget) and add it as another column.
data["sales"] = data["gross"] - data["budget"]

In [None]:
# 2) Which directors garnered the most total sales? Actors?
data.sort_values(by="sales", )[["director_name", "movie_title", "sales"]][:5]

In [None]:
# 3) Using the query function, what movies had more than 10M in sales AND more than 3 hours, is not from America and has 0 facebook likes? (These are the movies that are truly rare!)
data.query("sales > 1e7 & duration >= 180 & country != 'USA' & movie_facebook_likes == 0", ).loc[:, "movie_title"]

In [None]:
# 4) Identify and store in a list all movies that have duplicates. Use the duplicated() function.
data[data.duplicated(subset="movie_title")]

In [None]:
# 5) "The" is so cliche!
#     a) Get all movies that begins with a "The". Use the .str.startswith function

#     b) Get the count of each director that fulfills (a). Use value_count() or groupby() function.

#     c) Sort the results.

data[data["movie_title"].str.startswith("The")]["director_name"].value_counts()

In [None]:
# 6) Sample 1000 rows. Use the .sample function.
data.sample(1000)

In [None]:
# 7) Output a new column named "decade". Use the cut() function to create the following decades: [pre-50s, 50s, 60s, 70s, 80s, 90s, 2000s, 2010s]
data["title_decade"] = pd.cut(data["title_year"], [1950, 1960, 1970, 1980, 1990, 2000, 2010])
data["title_decade"]

In [None]:
pd.cut(data["title_year"], [1950, 1960, 1970, 1980, 1990, 2000, 2010]).value_counts()

# Lambda Functions
Lambda functions are nameless functions that can simplify a lot of workflows.

"Pythonic" is a term that encapsulates "elegance in simplicity". Lambda functions are Pythonic.

In [None]:
# applying to a Series
data["actor_1_facebook_likes"].apply(lambda x : np.sqrt(x))[:5]

In [None]:
# shortcut
data["actor_1_facebook_likes"].apply(np.sqrt)[:5]

In [None]:
# applying to a dataframe, aggregate across rows, compile across columns
# count number of unique cells per column
data.apply(pd.Series.nunique, axis=0)

In [None]:
# applying to a dataframe, aggregate across columns, compile across rows
# count the number of empty cells per row
data.apply(lambda s: pd.Series.isnull(s).sum(), axis=1)

# Fill Empty Cells

In [None]:
data["actor_1_facebook_likes"] = data["actor_1_facebook_likes"].fillna(0)
data["actor_1_facebook_likes"]

# Data Type Conversion

In [None]:
data["actor_1_facebook_likes"] = data["actor_1_facebook_likes"].astype(np.int64)
data["actor_1_facebook_likes"]

# Cell Replace
![image.png](attachment:image.png)

In [None]:
test = data.copy()
test[test['gross'] > 2000000]['budget'] = 0

In [None]:
test[test['gross'] > 2000000]["budget"][:2]

In [None]:
test.loc[test["gross"] > 2000000, "budget"] = 0
test[test['gross'] > 2000000]["budget"][:2]

In [None]:
data[data['gross'] > 2000000]["budget"][:2]

# Summary Statistics

In [None]:
data["duration"].describe()

In [None]:
data.describe()

In [None]:
data[["director_name", "language"]].describe()

# Remove Outliers with a Histogram Plot
1. One way is through the 99th percentile
2. Another way is through Tukey's fences
![Fences](https://i.stack.imgur.com/ty5wN.png)

## Formula
![ANother fences](https://lh3.googleusercontent.com/proxy/3eumgE2HN88LWe-803UoKWa6Zqy1o-qliKlz_yRC8YddBZ8iD-IGhceNNNOMawP9ppjzMft7ixaBBw8RiKwVHkw4cEjTCrPb-q87)

In [None]:
# value counts first, then plot
data["duration"].hist()

In [None]:
# seaborn is another visualization library that produces great plots
# https://seaborn.pydata.org/examples/index.html
import seaborn as sns
sns.distplot(data["duration"])

In [None]:
# one way is to clip everything above the 95th or 99th percentile
data["duration"].quantile(0.99)

In [None]:
data["duration"].clip(upper=data["duration"].quantile(0.99)).describe()

In [None]:
sns.distplot(data["duration"].clip(upper=data["duration"].quantile(0.99)))

# Correlation

In [None]:
cols = ["duration", "actor_1_facebook_likes", "gross", "budget", "director_facebook_likes", "movie_facebook_likes", "num_user_for_reviews", "sales"]
data[cols].corr()

## Conditional Formatting
https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [None]:
cm = sns.light_palette("green", as_cmap=True)

data_corr = data[cols].corr().round(2)
data_corr.style.background_gradient(cmap=cm).format('{:.0%}')

# Group By Operations
https://towardsdatascience.com/its-time-for-you-to-understand-pandas-group-by-function-cc12f7decfb9
## Concept
![Concept](https://jakevdp.github.io/figures/split-apply-combine.svg)


## Syntax
![Formula](https://miro.medium.com/max/1908/1*RpuOQTpRPwJhtqKikfG6hA.png)

In [None]:
data.groupby("title_year").size()

In [None]:
plt.figure(figsize=(10, 8))
data.groupby("title_year").size().plot()
plt.title("Number of Movies per Year")

In [None]:
# directors with the highest sales
data.groupby("director_name")["sales"].sum()

In [None]:
# get total sales per director and get top 10
top_10_directors = data.groupby("director_name")["sales"].sum().sort_values(ascending=False)[:10]
# reverse so we have a better graph
# top_10_directors = top_10_directors[::-1]
top_10_directors.plot.barh()

# Let's end with a super automated way of data profiling
BTW, this is a newer library. Automated data analysis is all the rage these days.

In [None]:
import pandas_profiling as pp
profile_report = data.profile_report()

In [None]:
profile_report

# For homework, check these other notebooks

- Exploratory Data Analysis

https://www.kaggle.com/abhi98/eda-on-imdb-dataset

- EDA with some ML

https://www.kaggle.com/vikasbhadoria/imdb-5000-movie-data-analyses-and-score-prediction
