# Numpy and Pandas 

## Goals

- Introduce the numpy and pandas libraries.

Learning objectives:

      - Numpy arrays and its mathmateical abilities
      
      - Importing data into pandas using CSVs
      
      - Slicing and filtering pandas dataframes
      
      - Cleaning data
      
      - Statistics and other math with pandas



- Two common Python libraries used for statistical analysis, data munging/wrangling/transformation, and other mathematical purpose.

- To put it simply, there is your connection to the data. Pandas is the most important tool because you'll spend the most time and effort with it. 


### Numpy

Numpy has a wide ecosystem of functions and uses but for the purpose of this course we will focus on arrays aka numpy's version of a list

In [None]:
# Import library
import numpy as np

In [None]:
#Let's turn a list into an array

l = [3,2,6,7,9,1,2,-5]

array = np.array(l)

In [None]:
#Call it
array

In [None]:
#Call type
type(array)

How arrays differ from lists

In [None]:
#Does this code work

l + 3

In [None]:
#What about this?
array + 3

In [None]:
#Multiply l by 2 
l * 2

In [None]:
#Multiply array by 2 
array * 2

Numpy array have mathematical abilities that lists don't have, which makes them easier to use

In [None]:
#Mean value
array.mean()

In [None]:
#Maximum value
array.max()

In [None]:
#Mininum value
array.min()

In [None]:
#Sum all values
array.sum()

In [None]:
#Find standard deviation
array.std()

In [None]:
#What happens when you do this
dir(array)

Can also use numpy itself to call certain functions

In [None]:
#Median
np.median(array)

In [None]:
#Square
np.square(array)

In [None]:
#Square root
np.sqrt(array)

In [None]:
# Absolute value
array.abs()

Arrays can also be multi-dimensional

In [None]:
#Make two dimensional numpy as with arange and reshape functions

np.arange(16)

In [None]:
arr_2d = np.arange(16).reshape(4,4)
arr_2d

<b>Slicing two dimension array<b>

In [None]:
#Slice rows
arr_2d[:3]

In [None]:
#Slice columns 
arr_2d[:, 1:]

In [None]:
#Slice both rows and columns
arr_2d[2: , :3]

In [None]:
#Slice specific value
arr_2d[1,2]

Fantastic numpy tutorial here: https://www.datacamp.com/community/tutorials/python-numpy-tutorial

## Pandas

From <u>[Mastering Pandas](https://www.packtpub.com/big-data-and-business-intelligence/mastering-pandas)</u>

    The pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, a large community behind it, (220+ contributors and 9000+ commits by 03/2014), rapid iteration, features, and enhancements continuously made.
    
    • It can process a variety of data sets in different formats: time series, tabular heterogeneous, and matrix data.
    • It facilitates loading/importing data from varied sources such as CSV and DB/SQL.
    It can handle a myriad of operations on data sets: subsetting, slicing,  ltering, merging, groupBy, re-ordering, and re-shaping.
    • It can deal with missing data according to rules defined by the user/ developer: ignore, convert to 0, and so on.
    • It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
    • It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.



In [None]:
#Import pandas library
import pandas as pd

In [None]:
#Create a pandas series.
series = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
series

In [None]:
#Return row
series["a"]

In [None]:
#Change value in the series
series["d"] = 7
series

In [None]:
#Turn python dictionary into pandas series
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
#Call California
population['California']

In [None]:
#Turn python dictinonary into pandas data frame

data = {"feature_one" :[1,2,4,8,-3],
       "feature_two" : ["haight", "mission", "geary", "castro", " potrero"],
       "feature_three": [True, True, False, True, False]}
df = pd.DataFrame(data, index=[6,7,8,9,10])

df

In [None]:
#Returns columns
df.columns

In [None]:
#Returns index
df.index

In [None]:
#Returns numpy array version of data frame. Also works on series.
df.values

In [None]:
#Call type on df
type(df)

In [None]:
#Call feature_one column
f1 = df["feature_one"]
f1

In [None]:
#Call type on f1
type(f1)

In [None]:
#Select columns
cols = ["feature_one", "feature_two"]
df[cols]

In [None]:
#Add new column to dataset

#Create new column feature_four by assigning it to number 4
df["feature_four"] = 4

#Create new column feature_five by assigning it to list ds
ds = ["Data", "Science", "Math", "Programming", "Hacking"]
df["feature_five"] = ds
df

First dataset we will work with is the the drinks dataset

In [None]:
#File location of drinks dataset
path = "../data/drinks.csv"

drinks = pd.read_csv(path)

In [None]:
#Take a look at the data

drinks.head()

In [None]:
#Let's designate the country column as the index

drinks.set_index("country", inplace=True)

In [None]:
#Head is used to view first 5 rows. 5 is default but can be changed.
drinks.head()

In [None]:
#Tail is for last five rows
drinks.tail()

In [None]:
#How many rows and columns are there in this dataset?
drinks.shape

General dataset information

In [None]:
#Lets look at this some details of this dataset
drinks.info()

In [None]:
drinks.describe()

In [None]:
drinks.corr()

What do these two commands do?

## Slicing dataframes

.loc

In [None]:
# Select values in Peru row
drinks.loc["Peru"]

In [None]:
#Select values in wine_servings column
drinks.loc[:, "wine_servings"].head()

In [None]:
#Slice countries and columns
drinks.loc["Germany": "Guyana", "beer_servings":"wine_servings"]

.iloc

In [None]:
#What do you think iloc does???

In [None]:
#Returns row at index 48
drinks.iloc[48]

In [None]:
#Returns column at index 1
drinks.iloc[:,1].head()

In [None]:
#Return slice of rows and columns
drinks.iloc[120:137:, 1:3]

iloc slices dataframes using the integer index.

<b>Conditional selection<b>

In [None]:
drinks.continent == "EU"

In [None]:
drinks.wine_servings > 20

Take those commands and pass them into the drinks data frame

In [None]:
drinks[drinks.continent == "EU"]

In [None]:
#Non european countries
drinks[drinks.continent != "EU"]

In [None]:
#Rows where wine_servings greater than 20
drinks[drinks.wine_servings > 20]

`drinks.continent=='EU'` by itself returns a bunch of Trues and Falses.

When you wrap drinks around it with square brackets you're telling the drinks dataframe to select only those that are True, and not the False ones.


In [None]:
#Return a data frame where both conditions are true
drinks[(drinks.continent == "EU") & (drinks.wine_servings > 20)]

In [None]:
#Return data frame where either condition is true
drinks[(drinks.continent == "EU") | (drinks.wine_servings > 20)]

In [None]:
#Return rows where wine_serving is greater than beer_servings
drinks[drinks.wine_servings > drinks.beer_servings]

In [None]:
#Call index to return just the countries
drinks[drinks.wine_servings > drinks.beer_servings].index

We can sum boolean values.

In [None]:
#How many countries consume no beer at all?
(drinks.beer_servings == 0).sum()

<b>Pandas Series<b>

In [None]:
#Assign beer_servings to variable beer
beer = drinks.beer_servings
beer.head()

In [None]:
#Can do math operations similar to numpy arrays
#Multiply every value in beer by 
beer*2

In [None]:
#Add 2 to every value in beer
beer + 2

In [None]:
#Derive mean of beer
beer.mean()

In [None]:
#Derive median of beer
beer.median()

In [None]:
#Sum all values in beer
beer.sum()

In [None]:
#Pandas series can be added to one another
wine = drinks.wine_servings
beer + wine

In [None]:
#Create a new column call total_servings that is the sum of the beer, wine, and spirits columns
drinks["total_servivings"] = beer + wine + drinks.spirit_servings
drinks.head()

In [None]:
#Let's take a look at continent
cont = drinks.continent

In [None]:
#How many null values are there in continent
#First check to see which values are null
cont.isnull()

In [None]:
#Replace every null with "No Continent"
cont.fillna("No Continent", inplace=True)

`.fillna()` is great replacing all the null values in a numerical column with the mean of that column

In [None]:
#Drop every null value in cont
#cont.dropna(inplace=True)

`.isnull()`, `.fillna()`, and `.dropna()` work with data frames as well

In [None]:
#What are the continents in cont?
cont.unique()

In [None]:
#How many unique values there are
cont.nunique()

In [None]:
#How many countries are from each continent?
cont.value_counts()

In [None]:
#What percentage of the data belongs to each continent
cont.value_counts(normalize=True)

Lets go back to drinks data frame

In [None]:
drinks.columns

In [None]:
#Show me the top 5 booziezt countries
drinks.sort_values(by="total_servings").head().index

Does that seem right?

We're forgetting something

In [None]:
drinks.sort_values(by="total_servings", ascending=False).head().index

In [None]:
#This also works (kinda)
drinks.sort_values(by="total_servings").tail().index

In [None]:
#Sort values in a series
beer.sort_values()

### Exercise time

1. Which countries drink more spirits than beer?

2. Find the top five booziest countries of each continent, do not include "No continent." Try using dictionary and for loop.

In [None]:
#Answer 1.

drinks[drinks.spirit_servings > drinks.beer_servings].index

In [None]:
#Answer 2.

drinks_dict = {}

for i in drinks.continent.unique():
    if i != "No Continent":
        value = drinks[drinks.continent == i].sort_values(by = "total_servings",
                                                        ascending=False).index.tolist()[:5]
        drinks_dict[i] = value
drinks_dict

## Groupby

**Split Apply Combine**

<img src="https://www.safaribooksonline.com/library/view/learning-pandas/9781783985128/graphics/5128OS_09_01.jpg">

In [None]:
#Group by continent
drinks.groupby("continent")

In [None]:
#Data needs to be accessed by certain methods
#Call .mean()
drinks.groupby("continent").mean()

In [None]:
#Call .median()
drinks.groupby("continent").median()

In [None]:
#What happens when you do .describe()
drinks.groupby("continent").describe()

In [None]:
#Call specific column on groupby object
drinks.groupby("continent").total_servings.min()

In [None]:
drinks.groupby("continent").total_servings.max()

End of drinks data. Any questions before we move on?

In [None]:
#Read in chiptole dataset

path = "../data/chipotle.tsv"
#Use read_table instead read_csv
chip = pd.read_table(path)
chip

We can see that there are some NaN present in the data set. Let's look at how many there are in each column.

In [None]:
#Call .isnull() and then call .sum()
chip.isnull().sum()

In [None]:
#What happens when you tack on another .sum()?
chip.isnull().sum().sum()

We need to fix the price column before converting it to a float

In [None]:
#Whats is item_price type?
chip.item_price.dtype

Pandas series have a string (`str`) method that lets you treat a column like a string

In [None]:
#Call .str
chip.item_price.str

In [None]:
#Replace $ with empty string and overwrite item_price column
chip["item_price"] = chip.item_price.str.replace("$", "")

In [None]:
#Change the type of column from object to float and overwrite item_price column
chip["item_price"] = chip.item_price.astype(float)

In [None]:
chip.head()

More examples of using the `str`

In [None]:
# chip.item_name.str.capitalize()

In [None]:
# chip.item_name.str.lower()

In [None]:
#chip.item_name.str.len()

We know how to drop rows with null values but how do we drop columns with null values?

In [None]:
#Drop columns with null values
# chip.dropna(axis=1)

Remember `chip.isnull().sum()`? 

In [None]:
#Set axis = 1 in .sum()
chip.isnull().sum(axis=1)

Axis refers to which direction you wish your command to follow. 1 == columns, 0 = rows. 0 is the default.

This is handy when it comes to dropping rows and columns

In [None]:
#Lets get rid of the item_name column

chip.drop("item_name")

Why didn't this work?

Forgot to do axis = 1 !!

Let's try it again

In [None]:
chip.drop("item_name", axis=1)

In [None]:
#Make it permanent
# chip.drop("item_name", axis = 1, inplace = True)

If you want to drop rows pass index name or list of index names

In [None]:
chip.drop([0, 2, 4, 6, 8])

Move on from Chipotle to Movie metadata

In [None]:
#Load in movie_metadata dataset
path = "../data/movie_metadata.csv"

movies = pd.read_csv(path)
movies.head()

In [None]:
movies.shape

In [None]:
movies.columns

In [None]:
#Check out the info
movies.info()

In [None]:
#Replace nulls in budget with median of budget
movies.budget.fillna(movies.budget.median(), inplace=True)

In [None]:
#Drop nulls
movies.dropna(inplace=True)

In [None]:
movies.shape

Filter out some data. Let's only look at movies that are rated G, PG, PG-13, and R.
We could do it this way movies[(movies.content_rating=="G") etc...]

But let's not!

In [None]:
#Make a list called ratings of the four ratings we want to use
ratings = ["G", "PG", "PG-13", "R"]

In [None]:
#Now lets check to see if a value in content "is in" ratings
movies.content_rating.isin(ratings)

In [None]:
#Return new data frame
movies = movies[movies.content_rating.isin(ratings)]

In [None]:
#check shape
movies.shape

## Mapping and applying functions

In [None]:
#Make a new column called profitable by subtracting budget from gross
movies["profit"] = movies.gross - movies.budget
movies.profit.head()

We want to know if a movie is profitable. How can we go about doing that?
With some python!

In [None]:
#Make a function that returns "profitable" for profit > 0 and "loss" for profit <= 0
def profit_decider(x):
    if x> 0:
        return "profitable"
    else:
        return "loss"

In [None]:
#Apply the function onto the profit column
movies.profit.apply(profit_decider)

In [None]:
#Make new column called "profitable"
movies["profitable"] = movies.profit.apply(profit_decider)

In [None]:
#How many movies are profitable and not profitable
movies.profitable.value_counts()

Use lambda functions

In [None]:
#Return 1 for movies directed by Christopher Nolan
movies.director_name.apply(lambda x: 1 if x == "Christopher Nolan" else 0)

`.apply()` allows us to transform values based on rules set by a function.

`.map()` allows us to use a dictionary to directly change values

In [None]:
#Example of changinge male to m and female to f
data.gender.map({"male":"m", "female":"f"})

A parent wants to make a new column called kid friendly where movies that G or PG are labelled "KF" and PG-13 and R movies are labelled "NKF"

In [None]:
#Map a dictionary onto the content_rating column and create new column called "kid_friendly"
kf_dict = {"G" : "KF",
          "PG" : "KF",
          "PG-13" : "NKF",
          "R" : "NKF"}
movies["kid_friendly"] = movies.content_rating.map(kf_dict)

Remember that you must account for every unique value or you will get null values

## Class work exercises

1) What percent of movies are directed by James Cameron?

In [None]:
movies.director_name.value_counts(normalize=True)["Clint Eastwood"]*100

2) What are the correlations among budget, gross, and imdb_score?

In [None]:
cols = ["budget", "gross", "imdb_score"]
movies[cols].corr()

3) How many PG-13 movies has Robert De Niro starred in?

In [None]:
movies[(movies.content_rating == "PG-13") & (movies.actor_1_name == "Robert De Niro")].shape[0]

4) How much money have non-English films generated?

In [None]:
foreign_films = movies[movies.language != "English"]
foreign_films.gross.sum()

5) Who are the top five grossing directors on average?

In [None]:
movies.groupby("director_name")["gross"].mean().sort_values(ascending=False).head()

5b) Now only look at directors who have directed more than five films

In [None]:
director_value_counts = movies.director_name.value_counts()
directors_with_more_than_5_movies = director_value_counts[director_value_counts>5].index

In [None]:
movies_directors_5 = movies[movies.director_name.isin(directors_with_more_than_5_movies)]

In [None]:
movies_directors_5.shape

In [None]:
movies_directors_5.groupby("director_name")["gross"].mean().sort_values(ascending=False).head()

6) How many movies contain "Action" in the genre column? What about Comedy? What about both Comedy and Action?

In [None]:
movies.genres.apply(lambda x:"Action" in x).sum()

In [None]:
#Do it using str method
movies.genres.str.contains("Action").sum()

In [None]:
movies.genres.apply(lambda x:"Comedy" in x).sum()

In [None]:
movies.genres.str.contains("Comedy").sum()

In [None]:
movies.genres.apply(lambda x:"Comedy" in x and "Action" in x).sum()

In [None]:
((movies.genres.str.contains("Comedy")) & (movies.genres.str.contains("Action"))).sum()

If you finish the exercises before the end of class, then further investigate the movies dataset on your own or any of the other sets in the data folder.

Be sure to check out any of these pandas resources

- Pandas cheatsheets in the extracurricular directory

- https://chrisalbon.com/ <- Great website. For now check out data wrangling section.

- Data school's collection of resources http://www.dataschool.io/best-python-pandas-resources/

- Data school tutorial in giant repo. http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

- http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
- http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb

- Repo with Pandas exercises https://github.com/guipsamora/pandas_exercises

- https://github.com/brandon-rhodes/pycon-pandas-tutorial

- https://github.com/jonathanrocher/pandas_tutorial

- https://github.com/chendaniely/scipy-2017-tutorial-pandas

- https://github.com/adeshpande3/Pandas-Tutorial



Youtube is your friend!! They are too many pandas tutorials videos to count.

<b>If you find a good resource be sure to share it with the rest of the class<b>