# Python Interactive Notebook

Today, we will be learning and using Python, specifically going over syntax, the Numpy library, Pandas library, and Matplotlib library! These are widely used and plenty of documentation can be found online. Don't be afraid to search Google/Stack Overflow!
1. Numpy: https://docs.scipy.org/doc/numpy-dev/user/index.html
- Pandas: http://pandas.pydata.org/pandas-docs/stable/
- Matplotlib: https://matplotlib.org/contents.html
- Syntax: https://www.w3schools.com/python/python_syntax.asp

# Table of Contents

I. [Syntax](#1)<br>
II. [Numpy](#2)<br>
III. [Pandas](#2.5)<br>
IV. [Matplotlib](#3)<br>

### Jupyter Notebook Recap

`To run a cell: select cell, press SHIFT + ENTER`

The last line of a cell is always displayed

In [1]:
"this will NOT be displayed"
"this will be displayed"

'this will be displayed'

If cells contain `...` we expect you to replace `...` with your code :)

## Import

`numpy`, `pandas`, and `matplotlib` are made by other people! We need to `import` these modules in order to use them. We won't be using the `seaborn` library in this workshop but this is another library in Python!

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# <font id="1" color="blue">Syntax</font>
---
So far, we've added the following *operations* to our toolbox:
- `+` , `-` , `*` , `/` : Add, subtract, multiply, divide
- `=` : Assign variables
- `<`, `>`, `<=`, `>=`, `==`: Compare values
---

Given that we have the following sales values for 2016 and 2017:

<header><h4 align='center'>Sales</h4></header>
<table border="1" class="dataframe">
    <thead>
        <tr>
            <td><b>Year</b></td>
            <td><b>Product</b></td>
            <td><b>Revenue</b></td>
            <td><b>Cost</b></td>
        </tr>
    </thead>
    <tr>
        <td>2016</td>
        <td>Phone</td>
        <td>320000</td>
        <td>254000</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>Laptop</td>
        <td>120000</td>
        <td>80000</td>
    </tr>
    <tr>
        <td>2017</td>
        <td>Phone</td>
        <td>465000</td>
        <td>362000</td>
    </tr>
    <tr>
        <td>2017</td>
        <td>Laptop</td>
        <td>105000</td>
        <td>67300</td>
    </tr>
</table>

**Task:** 
- Assign four variables to represent phone revenues and costs in 2016 and 2017, respectively.
- Use these four variables to create two new variables representing profit in 2016 and 2017, respectively.
- Use the two profit variables to calculate total combined profit in 2016 and 2017.

In [23]:
# Phone revenues and costs in 2016 and 2017
revenue_2016 = ...
cost_2016 = ...
revenue_2017 = ...
cost_2017 = ...

print(revenue_2016, cost_2016, revenue_2017, cost_2017)

Ellipsis Ellipsis Ellipsis Ellipsis


In [24]:
# Profit in 2016 and 2017
profit_2016 = ...
profit_2017 = ...

print(profit_2016, profit_2017)

Ellipsis Ellipsis


In [25]:
# Combined total profit
total_profit = ...

total_profit

Ellipsis

**Task:**
  - What were average monthly profit figures over 2016 and 2017, combined?
  - Our goal was to achieve $10,000 in average monthly profit over the last year. Did we achieve this? Format the answer as a boolean (`True` or `False`)

In [26]:
# Monthly profit figures
monthly_profit = ...

monthly_profit

Ellipsis

In [27]:
# Success?
success = ...

success

Ellipsis

## Lists and loops
---
We now have additional tools in our toolbox:
- Lists allow us to store multiple values in one variable
- Loops allow us to operate on lists by *iterating* through each value in the list
---

Ideally, all of our raw datasets would be cleanly organized and formatted exactly to our needs. For example, if we were looking for data regarding annual revenue, cost, and profit, our table from Section 1 contains exactly the information that we want. However, not all data is this cleanly formatted and aggregated into rows; furthermore, this may not necessarily be in our best interest! (Why not?)

Instead, we might have more granular data in the form of monthly values:

<header><h4 align='center'>2016 Phone Sales</h4></header>
<table border="1" class="dataframe">
    <thead>
        <tr>
            <td><b>Year</b></td>
            <td><b>Month</b></td>
            <td><b>Revenue</b></td>
            <td><b>Cost</b></td>
        </tr>
    </thead>
    <tr>
        <td>2016</td>
        <td>1</td>
        <td>33000</td>
        <td>26800</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>2</td>
        <td>24000</td>
        <td>19200</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>3</td>
        <td>19000</td>
        <td>15900</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>4</td>
        <td>20000</td>
        <td>16300</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>5</td>
        <td>21000</td>
        <td>15000</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>6</td>
        <td>23000</td>
        <td>18000</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>7</td>
        <td>21000</td>
        <td>16700</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>8</td>
        <td>26000</td>
        <td>21300</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>9</td>
        <td>24000</td>
        <td>21000</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>10</td>
        <td>28000</td>
        <td>23000</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>11</td>
        <td>43000</td>
        <td>35700</td>
    </tr>
    <tr>
        <td>2016</td>
        <td>12</td>
        <td>38000</td>
        <td>30100</td>
    </tr>
</table>

Then the revenue and cost columns can be represented using two lists:

In [28]:
monthly_revenue_2016 = [33000, 24000, 19000, 20000, 21000, 23000, 21000, 26000, 24000, 28000, 43000, 38000]
monthly_cost_2016 = [26800, 19200, 15900, 16300, 15000, 18000, 16700, 21300, 21000, 23000, 35700, 30100]

**Task:**
- For 2016 phone sales' monthly revenue and monthly cost, find each of the following:
  - Mean
  - Standard deviation ([`np.std`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.std.html))
  - 25, 50, 75th percentiles ([`np.percentile`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html))

In [29]:
# Mean
revenue_2016_mean = ...
cost_2016_mean = ...

print(revenue_2016_mean, cost_2016_mean)

Ellipsis Ellipsis


In [30]:
# Standard deviation
revenue_2016_std = ...
cost_2016_std = ...

print(revenue_2016_std, cost_2016_std)

Ellipsis Ellipsis


In [31]:
# 25, 50, 75th percentiles
revenue_2016_perc = ...
cost_2016_perc = ...

print(revenue_2016_perc, cost_2016_perc)

Ellipsis Ellipsis


Before we complete the next task, observe the following two cells to observe the difference between the two:

In [32]:
[1, 2, 3] + [1, 2, 3]

[1, 2, 3, 1, 2, 3]

In [33]:
np.add([1, 2, 3], [1, 2, 3])

array([2, 4, 6])

We see that the `+` operator *concatenates* the two lists, whereas [`np.add`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.add.html) performs *element-wise* addition. Similarly, [`np.subtract`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.subtract.html) will perform *element-wise* subtraction.

**Task:**
- Use [`np.subtract`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.subtract.html) to calculate the profit for each month in 2016.

In [34]:
# Monthly profit 2016
monthly_profit_2016 = ...

monthly_profit_2016

Ellipsis

**Task:**
- (Challenging!) Use a loop to calculate the month-over-month raw change (in dollars, *not* percentages) for 2016 profits.
  - Hint: Since there are 12 months (`len(monthly_profit_2016)`), there are 11 values that we will want to calculate.

In [35]:
# Month-over-month profit change
monthly_profit_change = []

for i in ...:
    ...
    monthly_profit_change.append(...)
    
monthly_profit_change

TypeError: 'ellipsis' object is not iterable

# <font id="2" color="blue">Numpy</font>

<tr><td>
<img src="http://rickizzo.com/images/posts/2017-12-19/numpy.jpeg"/></td>

<td style="text-align:left">Numpy's main use is ```np.array```
<br><br>
Numpy arrays take less space than built-in lists and come with a **wide variety of useful functions.**</td></tr>

In [3]:
# make an array
a = np.array([2,3,4])
a

array([2, 3, 4])

In [4]:
# make a 2-dimensional array (matrix)
matrix = np.array([ [1,2,3],
                    [4,5,6],
                    [7,8,9] ])
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Linear Algebra!

In [5]:
# you can multiply matrices with np.dot
np.dot(matrix, a)

array([20, 47, 74])

### Arithmetic with numpy!

**You can add/subtract/multiply/divide with numpy arrays!** You *cannot* do this with built-in python lists.

In [6]:
a + 5

array([7, 8, 9])

In [7]:
a * -1

array([-2, -3, -4])

In [8]:
b = np.array([3, 2, 1])
a + b

array([5, 5, 5])

If you try to perform operations on two arrays of different lengths, <span style="color:red">an error will occur.</span> Try running the following cell!

In [9]:
# Run me!
b + np.array([1, 2, 3, 4])

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

Use ```len( array )``` to find length of array.

In [None]:
len(b)

In [None]:
len(np.array([1, 2, 3, 4]))

Conditionals apply to every element of a numpy array as well. This will come in handy later!

In [None]:
a = np.array([1, 2, 3, 1, 1])
a == 1

### Essential array functions

Why do we use Numpy? **Numpy provides a multitude of useful functions for arrays.** We'll teach you a few (many more exist!)

<font color="blue">Exercise:</font> Search online how to find the mean of a numpy array.

In [None]:
x = np.array([1, 5, -7, 18, 1, -2, 4])

In [None]:
# Find the mean of array x
x_mean = np.mean(...)

Here, we'll give you a list of some useful numpy functions. Remember, you can easily find info about these by searching google / numpy documentation!

In [None]:
np.sum(x)

In [None]:
np.min(x)

In [None]:
np.max(x)

In [None]:
np.median(x)

In [None]:
np.cumsum(x)

In [None]:
np.abs(x)

What do you think ```np.cumsum``` does? Note, numpy has a similar function ```np.cumprod```. Try it!

What do you think ```np.diff``` does?

In [None]:
np.diff(x)

Two super useful functions in numpy are `np.arange` and `np.linspace`. They allow you to craft arrays with equidistant values:
* np.arange asks for [`start`], `stop`, and [`step`]
* np.linspace asks for `start`, `stop`, and `num`

In [None]:
np.arange(0, 100, 10)

In [None]:
np.linspace(0, 100, 15)

### Python

Using ```np.arrays``` in python is a little bit different than with built-in lists.

In [None]:
a = np.array([2, 3, 4])
b = [2, 3, 4]
print(a)
print(b)

#### Adding values to np.array is different

In [None]:
b.append("hello")
b

In [None]:
a = np.append(a, 'hello')
a

#### For loops work the same way

In [None]:
c = np.array([1, 2, 3, 4, 5])
cumulative_product = 1

for element in c:
    cumulative_product *= element
    
cumulative_product

### <font color="blue">Numpy Exercises</font>

Use [`np.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) to create an array called `arr1` that contains every odd number from 1 to 100, inclusive. Write your code in the `...` sections

In [None]:
arr1 = np.arange(...)
arr1

Use `arr1` to create an array `arr2` of every number divisible by 4 from 1 to 200, inclusive.

In [None]:
arr2 = ...
arr2

Create the same array, but using [`np.linspace`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) instead. Call this array `arr3`.

In [None]:
arr3 = np.linspace(...)
arr3

Print the following summary statistics for `arr3`: 

* minimum
* 1st quartile (Hint: See [`np.percentile()`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html))
* median
* mean
* standard deviation
* 3rd quartile
* max


In [None]:
print('Minimum: '            + str(...))
print('1st quartile: '       + str(...))
print('Median: '             + str(...))
print('Mean: '               + str(...))
print('Standard Deviation: ' + str(...))
print('3rd Quartile: '       + str(...))
print('Max: '                + str(...))

# <span id="2.5" style="color: blue">Pandas</span>

<tr><td><img width=200 src="https://c402277.ssl.cf1.rackcdn.com/photos/13100/images/featured_story/BIC_128.png?1485963152"/></td><td>

Pandas is all about tables!</td></tr>

A table is called a ['dataframe'](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) in Pandas. Consider the table `fruit_info`:



<table border="1" class="dataframe">
  <thead><tr><td>**color**</td><td>**fruit**</td></tr></thead>
<tr><td>red</td><td>apple</td></tr>
<tr><td>orange</td><td>orange</td></tr>
<tr><td>yellow</td><td>banana</td></tr>
<tr><td>pink</td><td>raspberry</td></tr>
</table>

## Pandas Series

Let's break this table down. DataFrames consist of columns called **```Series```**. Series act like numpy arrays.

_How to make a Series:_

1.   create a numpy ```array```
2.   call ```pd.Series(array, name="...")``` &nbsp;&nbsp; <font color="gray"># name can be anything</font>

<font color="blue">Exercise:</font> Make a Series that contains the colors from `fruit_info` and has `name='color'`


In [None]:
array = np.array(...)
color_column = pd.Series(...)
color_column

<font color="blue">Exercise:</font> Make another Series for the fruit column:

In [None]:
array = np.array(...)
fruit_column = pd.Series(...)
fruit_column

Combine your Series into a table!

`pd.concat([ series1, series2, series3, ... ], 1)`

Don't forget the ```1``` or you'll just make a giant Series.

In [None]:
fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info

What if we were given the DataFrame and we want to extract the columns?

In [None]:
fruit_info['fruit'] # we get the fruit_column Series back!

### Add Columns

Add a column to `table` labeled "new column" like so:

`table['new column'] = array`

In [None]:
fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info

<font color="blue">Exercise:</font> Add a column called ```rating``` that assigns your rating from 1 to 5 for each fruit :) 

In [None]:
fruit_info['rating'] = ...

fruit_info  # should now include a rating column

### Drop

<font color="blue">Exercise:</font> Now, use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `color` column.

In [None]:
fruit_info_without_color = ... # must include axis=1

fruit_info_without_color

## California Baby Names

Time to use a real dataset!

You can read a `.csv` file into pandas using `pd.read_csv( url )`.

Create a variable called `baby_names` that loads this data: `https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/ca_baby_names.csv`



In [12]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")

Let's display the table. We can just type `baby_names` and run the cell but baby_names is HUGE! So, let's display just the first five rows with:

`DataFrame.head( # of rows )`

In [None]:
baby_names.head(5)

## Row, Column Selection

Follow the structure:

`table.loc[rows, columns]`

`table.loc[2:8, [ 'Name', 'Count']]`

The above code will select columns "Name" and "Count" from rows 2 **through** 8.

In [None]:
# Returns the name of our columns
baby_names.columns

In [None]:
baby_names.loc[2:8, ['Name', "Count"]]

<font color="blue">Exercise:</font> Return a table that includes rows 1000-1005 and only includes the column "Name".

In [None]:
baby_names.loc[...]

In [None]:
# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)

### Selecting an entire Column

Remember we can extract the column in the form of a **Series** using:

`table_name['Name of column']`

In [None]:
name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!

### Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let's go back to the simpler `fruit_info` table.

In [None]:
fruit_info

In [None]:
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]

## Filtering Data

So far we have selected data based off of row numbers and column headers. Let's work on filtering data more precisely.

`table[condition]`

In [None]:
condition = baby_names['Name'] == 'Shalini'
baby_names[condition].head(5)

The above code only selects rows that have Name equal to 'Shalini'. Change it to your name!

### Apply multiple conditions!

 `table[ (condition 1)  &  (condition 2) ]`
 

 
<font color="blue">Exercise:</font> select the names in Year 2000 that have larger than 3000 counts.

In [None]:
result = ...
result.head(3)

### Thorough explanation:

Remember that calling `baby_names['Name']` returns a **Series** of all of the names.

Checking if values in the series are equal to `Shalini` results in an array of {True, False} values. 

Then, we select rows based off of this boolean array. Thus, we could also do:

In [None]:
names = baby_names['Name']
equalto_Shalini = (names == 'Shalini')  # equalto_Shalini is now an array of True/False variables!
baby_names[equalto_Shalini].head(5)

## Using Numpy with Pandas

How many rows does our `baby_names` table have?

In [None]:
len(baby_names)

That's a lot of rows! We can't just look at the table and understand it.

Luckily, **Numpy** functions treat pandas **Series** as np.arrays.

<font color="blue">Exercise:</font> What is oldest and most recent year that we have data from in `baby_names`?
HINT: np.min, np.max

In [None]:
recent_year = ...
oldest_year = ...
(recent_year, oldest_year)

<font color="blue">Exercise:</font> How many baby names were born in CA in 2015?

Hint: the 'Count' column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.

In [None]:
baby_names_2015 = ...
baby_names_2015_counts = ...
number_baby_names_2015 = np.sum(...)
number_baby_names_2015

In [None]:
# Or to do it all in one operation:
...

## Group By with Pandas

In the previous section we calculated the number of baby names registered in 2015.

In [None]:
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])

There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.

`groupby` to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by "Year" we would create 107 groups because there are 107 unique years.


<center>`baby_names.groupby('Year')`</center>


Now we have 107 groups but what do we do with them? We can apply the function `sum` to each group. This will sum the other numerical column, 'Counts' which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

In [0]:
# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)

Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

In [3]:
baby_names.head() #here is what the baby_names dataframes looks like again for reference

NameError: name 'baby_names' is not defined

How many female baby names are there? (Hint: use group by plus another aggregate function)

In [17]:
female_baby_names = ...
female_baby_names

How many female and male baby names are there per year? (Hint: use two aggregate functions this time!)

In [1]:
female_and_male_baby_names_per_state = ...
female_and_male_baby_names_per_state

Ellipsis

What is the average number of names per year?

In [16]:
average_names_per_year = ...
average_names_per_year

# <font color="blue" id="3">Matplotlib</font>


## Line Graphs
Use [`plt.plot()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) to create line graphs! The required arguments are a list of x-values and a list of y-values.

In [0]:
np.random.seed(42) # To ensure that the random number generation is always the same
plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
plt.show()

In [0]:
%matplotlib inline

plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
# plt.show() no longer required

## Histograms
To explore other types of charts, let's load in a built-in dataset from Seaborn and first take a quick peek:

In [0]:
tips = sns.load_dataset('tips')
tips.head()

Histograms can be plotted in matplotlib using [`plt.hist()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html).
This will take one required argument of the x-axis variable.

In [0]:
plt.hist(tips['total_bill'])

## Scatterplots
Scatterplots can be made using [`plt.scatter()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html). It takes in two arguments: x-values and y-values.

In [0]:
plt.scatter(tips['total_bill'], tips['tip'])

In [0]:
plt.scatter(tips['total_bill'], tips['tip'])
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount')

In [0]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

# Points with smoker == 'yes'
plt.scatter(x=tips.loc[tips['smoker'] == 'Yes', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'Yes', 'tip'],
            label='Smoker', alpha=0.6)

# Points with smoker == 'no'
plt.scatter(x=tips.loc[tips['smoker'] == 'No', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'No', 'tip'],
            label='Non-Smoker', alpha=0.6)

plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount (by Smoking Habits)')
plt.legend()

## Exercises in Matplotlib
We'll do the exercises using a famous dataset: [the iris dataset](https://archive.ics.uci.edu/ml/datasets/iris).
First, let's load it in and take a look:

In [0]:
iris = sns.load_dataset('iris')
iris.head()

![alt text](https://www.wpclipart.com/plants/diagrams/plant_parts/petal_sepal_label.png)

Let's also take a look at the different species:

In [0]:
iris['species'].unique()

<font color="blue">Exercise:</font> Create a basic scatterplot of the petal lengths versus the petal widths. Label your axes (use the documentation linked above to make them meaningful)!

In [0]:
...

<font color="blue">Exercise:</font> This time, create the same scatterplot, but assign a different color for each flower species.

In [0]:
plt.scatter(...)
plt.scatter(...)
plt.scatter(...)
plt.legend()

In [0]:
def plot_by_species(species, x, y):
    plt.scatter(x=iris.loc[iris['species'] == species, x],
             y=iris.loc[iris['species'] == species, y],
             label=species)

for species in iris['species'].unique():
    plot_by_species(species, 'sepal_length', 'sepal_width')

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width (by Species)')
plt.legend()

That's the end of our workshop!

We hope you learned something. Keep this notebook handy for reference later!

<img width="120" src="https://dss.berkeley.edu/static/img/logo.jpg"/>