# Movie Data Analysis

Let's use some movie data to learn about using Python to analyze data using Pandas.

We will be using data from The Movie Database (TMDB) to answer some basic questions.

In [1]:
# First things first let's import some libraries:
import pandas as pd
import seaborn as sns

%matplotlib inline

# Python Basics

We can tell python that we can create variables just using the `=` operator to assign a value to a new variabe.

All we need to do is something like this:
```Python
a = 3
```

One of the beautiful things about python is that we don't need to _explicitly_ declare what type of variable we are declaring...we just asign it a value and python makes a guess about the type. In our example above, we gave the variable `a` the value `3`. Python sees the value is an integer and assumes you want `a` to be of type `int`.

We have many types of data available to us in python: integers, floats, bool, strings, dataframes, etc...

Let's make some new variables and assign different types of values to them. If the variable name you want to use has a couple of words, just replace the spaces with underscores _(this is called snake case)_ like so:

```Python
my_awesome_new_variable = "Howdy! I'm a string!"
```

---
## Practice Section 1

In [2]:
# Make a new variable called 'b' and give it an integer value:
b = 7

In [3]:
# we can get python to print the value back out by using the 'print' function like this:
print(b)

7


In [None]:
# Make a variable named 'x' and give it a decimal value (also known as a 'float'):


In [None]:
# Now make a variable named 'my_string' and give it a string:


In [None]:
# We can also assign values that are Boolean (True or False), make one called 'my_bool':


In [None]:
# Now print out the values for all your new variables:


---

# More Complex Data Structures

Beyond just simple data types, python can also store more complex types. We can collect sets of values into structures like `lists`, `tuples`, and `dictionaries`.

We tell python to assign a `list` using square brackets like so:
```Python
my_list = [1,2,4,8,'yup here's a string', 2.7, True]
```

Tuples are just like lists, except that once values are assigned, we cannot change them later _(a property called immutability)_. Also, we tell python we want a tuple by using parentheses:
```Python
some_tuple = (2.542, 786.5, 54)
```

Dictionaries are used to store collections of key / value pairs. Meaning that we can lookup a key in our dictionary to get it's value. We use curly braces to assign a dictionary:
```Python
new_dict = {"key_1": 2, "another_key": "Yo!", "more_keys": 3.786}
```

We can get single values back from our dictionary by calling a key with square brackets:
```Python
new_dict['key_1']
```

Let's get those fingers typing again! Make some lists, tuple, and dictionaries...after you make your dictionary, print one of the values by calling the key in square brackets.

---
## Practice Section 2

In [None]:
# First up, make a list (remember to use square brackets!):


In [None]:
# Now make a tuple (use parentheses!):


In [None]:
# Lastly make a dictionary (use curly braces, seperate "key": value pairs with a comma):


In [None]:
# Now that you have a dictionary, print out the value associated with one of your keys.
#    (remember to use your variable name followed by square brackets around your key name!)


---
# Getting Specific Values Back From Lists and Tuples

We can use the index value to return that specific value back from a list or a tuple. We just have to specify the index number in square brackets...**NOTE:** python starts counting at zero!

Let's see this in action:

In [4]:
example_list = [1, 2, 3.2, "fourth element in the list", "last element"]

# We can get the fist element by using its index (remember 0 is the first index!):
print(example_list[0])

# We can get the fourth element by using the index 3:
print(example_list[3])

# We can also get a range of values back by using a semicolon:
print(example_list[0:3])

1
fourth element in the list
[1, 2, 3.2]


Notice how the range omitted the last index _(it did not print index 3 - the fourth element)_.

There are a few short cuts that are helpful to know. For example, if we want the last element in a list but we may not want to count out exactly how many elements are in our list we can just use the index `-1` instead:

In [5]:
example_list[-1]

'last element'

---
# Flow Control

Our programs will likely need to behave differently based on some conditions. We can use things like `if`/`elif`/`else` blocks or `for` loops to modifying what your script does based on some conditions.

For example, maybe want to see if your customers are old enough to place an order for beer:

```Python
customer_age = 37

if customer_age >= 21:
    print("Yup, it's Miller time!")

else:
    print("Nice try youngbuck! No beer for you...")
```

Take a look at the script above...what do you think should be printed?

Some things to take notice of:
* notice that we have the keyword `if` followed by a condition, this condition **has** to evaluate to either True or False
* After our condition, we need to put a colon
* if the condition is True, python will execute all of the code in the block under the `if` statement that is indented 4 spaces
* Python uses indentation rather than curly braces to define where code blocks start and end
* The `else` statement is optional

It's possible that we need to check for multiple conditions. For example, we may not want to sell beer to customers that are too old. We can stack these conditions together like this:

```Python
customer_age = 37

if customer_age >= 21 and customer_age < 100:
    print("Go ahead, grab a cold one!")

elif customer_age >= 100:
    print("Whoa there oldie...you may want to check with your doctor first.")
          
else:
    print("Nope!")
```

---
## Practice Section 3

In [None]:
# Create two integer variables, then write an if/elif/else block that prints whether the 
#    two variables are equal, the first is less than the second, or the fist is greater
#    than the second:


---
# For Loops

For loops are perfect for when you want to repeat a block of code multiple times. For example, we might want to check to see if each one of our customers are able to order beer. We can have our block of code loop through each customer's age one-by-one using a list of ages and a foor loop.

Let's see this in action:

In [6]:
customer_ages = [19, 3, 105, 37, 22]

for age in customer_ages:
    if age >= 21 and age < 100:
        print("Go ahead, grab a cold one!")

    elif age >= 100:
        print("Whoa there oldie...you may want to check with your doctor first.")

    else:
        print("Nope!")

Nope!
Nope!
Whoa there oldie...you may want to check with your doctor first.
Go ahead, grab a cold one!
Go ahead, grab a cold one!


Some things to notice:
* Just like `if`/`else` we start with a special keyword, in this case `for`
* We define a temporary variable that holds the values we are iterating over _(here we called it `age` but we can call it whatever we want)_
* Just as before we use indentation to tell python what we want to execute for each iteration

Watch out for infinte loops! If we accidentally write a loop that continues indefintely our program will never stop running!! This is more of an issue for `while` loops, but in case you find yourself with a wild loop running out control, just interupt the kernel by pressing the square stop button next to the `Run` button in the control bar _(or by hitting `Ctrl` and `c` if you are running your script from the command line/terminal)._

---
## Practice Section 4

In [None]:
# Create a list of strings, then write a for loop to print the number of characters in each 
#    string in your list

# HINT: Use the len() function to get the length of a string


---
# Functions

We often need to write our own functions that do some specific task that isn't already built into python. This are super handy when we need to run that block of code over and over again.

Imagine we have a program that needs us to convert from centimeters to inches. This program needs to do this conversion a dozen times. If we just write out the formula out each time we need it, we have a dozen identical blocks of code, which is pretty annoying and redundant. Also, what if our manager changes the requirements, and decides they want it converted to feet instead of inches? We now have to go through and find every place we did the conversion and update it! Even worse, maybe we missed one after the update...uh-oh, troubles are on the horizon!

This is why it is good practice to keep your programs DRY.

DRY = "Don't Repeat Yourself"

When you write a function to do this conversion, it only lives in one place and can be easily updated no matter how many times our program has to call it!

Let's see what a function definition looks like:

```Python
def convert_len(input_cm):
    """
    Input a numeric value for centimeters and return the equivalent length in inches
    """
    inches = input_cm / 2.54
    return inches
```

Here's what you need:
* Start with the keyword `def`
* Next we give our function a unique name _(make it snake case)_
* Follow the name with a set of parentheses
* If your function needs some input from the user _(we call these parameters)_ name them inside the parentheses...if you need more than one, seperate them with commas
* End your definition line with a colon
* Make sure you indent all the code block you want your function to run!
* Also, the `return` statement is what you want your function to give back when it's finished running _(this is actually optional - your function doesn't always have to return something)_

The example above also has a documentation string on the first line of the function block. This is also optional, but it makes your programs so much easier to read for other programmers _(or for ourselves when we look back over something we wrote a while ago)_.

Let's pull some of this together and create a list of lengths in centimeters, and then use a `for` loop to covert all of those lengths to inches using our function above:

In [8]:
def convert_cm_len(input_cm):
    """
    Input a numeric value for centimeters and return the equivalent length in inches
    """
    inches = input_cm / 2.54
    return inches

cm_lengths = [2.4, 14.57, 250, 0.897]

for cm in cm_lengths:
    print(convert_cm_len(cm))

0.9448818897637795
5.736220472440945
98.4251968503937
0.3531496062992126


---
## Practice Section 5

In [None]:
# Write a function that takes in lengths in inches and converts them to feet
#    hint: there are 12 inches in a foot


# Now create a list of lengths of inches:


# Use a for loop to convert each of these lengths to feet using your new function above:



# BONUS: write another function that uses our earlier function `convert_cm_len` to take in 
#    centimeters and converts them to feet:


---
# Libraries

One of python's strengths is data manipulation and analysis. Often this analysis is done on  tabular data _(like spreadsheets)_.

Because python is open source, there are tons of libraries available to us _(for free!)_ tha allows us to extend what we can do. All we need to do is install and load the packages we want.

If you installed Anaconda on your machine, then you've already got a whole buch of libraries installed that help us with common data manipulation tasks like `pandas` and `numpy`. Beyond that it also installed lot's of common packages for data science as well _(like `sklearn` and `scipy`)_. All we need to do is import them here in our program.

**If you didn't install Anaconda**, you may need to go install these packages for Python3 from your command line or terminal.

From a mac, open up `Terminal` and type the following:
```pip3 install pandas```

From a PC, open up `Command Prompt` and type the following:
```python -m pip install pandas```

These commands tell your computer to connect to the central python repository and download the latest version of the pandas package.

Now, in our program we just need to import it, which we can do like so:
```Python
import pandas as pd
```

Notice that because programmers are usually lazy, most of these common libraries get imported with an alias that is simpler and shorter _(and easy to type)_ which helps keep our python code shorter and easier to read.

Scroll all the back up to the first cell in this notebook...see all those import statements? That is thousands of hours of development time that we loaded into our program with a few lines of typing _(for free!)_.

We can now start using the pandas library to work with tabular data in our python program!

---
# Pandas

The main data structure for pandas is tabular data, which is called a DataFrame. We can enter this data in ourselves manually. More likely, we will be getting this data from other sources like a SQL query to our database or from an external file like a `.csv` file.

We can create a new, empty dataframe like this:
```Python
my_df = pd.DataFrame()
```

We can then add series of values into columns of our dataframe like this:
```Python
my_df['First Column'] = [1.6, 2.0, 3.77, 0.917]
my_df['Second Column'] = ['string 1', 'string 2', 'string 3', 'string 4']
my_df['Mixed Column'] = [1, 54.78, 'yuck, a string!', True]
```

Notice that we can have different types of data in the same column.

Let's create our fist dataframe and take a look at it:

In [9]:
my_df = pd.DataFrame()

my_df['First Column'] = [1.6, 2.0, 3.77, 0.917]
my_df['Second Column'] = ['string 1', 'string 2', 'string 3', 'string 4']
my_df['Mixed Column'] = [1, 54.78, 'yuck, a string!', True]

my_df

Unnamed: 0,First Column,Second Column,Mixed Column
0,1.6,string 1,1
1,2.0,string 2,54.78
2,3.77,string 3,"yuck, a string!"
3,0.917,string 4,True


If we want to see how many rows and columns our datafrmae has, we just need to use the `shape` argument:

In [10]:
my_df.shape

(4, 3)

If we want to know what type the data in each column is, we can use the `dtypes` attribute:

In [13]:
my_df.dtypes

First Column     float64
Second Column     object
Mixed Column      object
dtype: object

Notice that our `Mixed Column` is of type `object` because it has a mixture of differing types!

We can also get a sense of the distribution of the data in our dataframe using the `describe()` method. This only works on columns that are strictly numeric values!

In [14]:
my_df.describe()

Unnamed: 0,First Column
count,4.0
mean,2.07175
std,1.217264
min,0.917
25%,1.42925
50%,1.8
75%,2.4425
max,3.77


Just like our lists from above, we can select specific values out of our dataframes. We can grab whole columns:
```Python
my_df['First Column']
```

Whole rows:
```Python
my_df[0]
```

Or only specific rows and columns:
```Python
my_df[0:2, ['Second Column', 'Mixed Column']]
```

Because our tables have 2 dimensions _(rows and columns)_ we have to give the indeces for both dimensions we want! This is called `slicing` our dataframe.

One thing to notice, unlike with lists, when we give ranges it will include the index given for the end of the range _(so index 0:2 will give rows 0, 1, and 2)_.

---
## Practice Section 6

In [None]:
# Create a blank DataFrame with a few columns each with several values:


# Now check how many rows anf columns your DataFrame has:


# Check what data types each of your DataFrame's columns are:



The pandas library has tons of features included in it! We are barely scratching the surface here today. There are lots of different ways to load in data and lots of ways to mask and slice that data depending on what your program needs to do.

Sometimes you will see slicing done with `.at()`, `.loc()`, or with `.ioc()`. Don't worry about this for the time being, but just be aware that there is a whole robust ecosystem of data analysis available to you in pandas!

----
# Ordering and Masking

A lot of times we will want to see the our dataframe ordered by the values in some column. For example, we might want to look at our dataframe ranked by the values in `First Column`:

In [15]:
my_df.sort_values(by='First Column')

Unnamed: 0,First Column,Second Column,Mixed Column
3,0.917,string 4,True
0,1.6,string 1,1
1,2.0,string 2,54.78
2,3.77,string 3,"yuck, a string!"


Hmmm...pandas assumes we want the values ranked in ascending order. But, if we want them descending, we just have to modify one of the default argument values like this:

In [16]:
my_df.sort_values(by='First Column', ascending=False)

Unnamed: 0,First Column,Second Column,Mixed Column
2,3.77,string 3,"yuck, a string!"
1,2.0,string 2,54.78
0,1.6,string 1,1
3,0.917,string 4,True


Also, notice that pandas by default doesn't want to modify the underlying dataframe _(so we don't accidentally mess up our data)_ :

In [17]:
my_df

Unnamed: 0,First Column,Second Column,Mixed Column
0,1.6,string 1,1
1,2.0,string 2,54.78
2,3.77,string 3,"yuck, a string!"
3,0.917,string 4,True


Typically, when we want pandas to make the modifications permanent we need to update the `inplace` parameter:

In [18]:
my_df.sort_values(by='First Column', ascending=False, inplace=True)

In [19]:
my_df

Unnamed: 0,First Column,Second Column,Mixed Column
2,3.77,string 3,"yuck, a string!"
1,2.0,string 2,54.78
0,1.6,string 1,1
3,0.917,string 4,True


Masking is where we only want to get the rows back that meet some condition. For example, maybe we only want to get rows back where the value in the first column are greater than or equal to 2.

Here's how we mask a DataFrame:

In [20]:
my_df[my_df['First Column'] >= 2]

Unnamed: 0,First Column,Second Column,Mixed Column
2,3.77,string 3,"yuck, a string!"
1,2.0,string 2,54.78
