# NumPy + Pandas

### Learning Objectives

- Use NumPy in a Jupyter notebook to perform mathematical operations
- Import data from CSV files using Pandas
- Alter a Pandas dataframe by filtering or slicing to get a subset of data
- Use a combination of NumPy and Pandas to clean data

### Overview
- NumPy and Pandas are two common Python libraries used for statistical analysis, data wrangling and advanced mathematical operations
- To put it simply, they are you connection to the data
- Pandas is the most import tool as you'll spend the most time using it

## NumPy

In [None]:
# import the NumPy library


The most fundamental data object in NumPy is the multidimensional array. Think of an array as a table of elements, similar to an Excel spreadsheet, consisting of items all of the same type (usually numbers).

In [None]:
# Let's turn a list into an array

data = [1, 3, 5, 7, 9, -1]
array = 

In [None]:
# Call the array


In [None]:
# Print out the type of the array


How does an array differ from a list?

In [None]:
# Can you perform a mathematical operation with a list and an integer?

data + 5

In [None]:
# What about with an array and an integer?

array + 5

NumPy arrays have easy to use mathematical abilities that lists don't have which is why they're better to use.

In [None]:
# Calculate the mean of the array


In [None]:
# Find the max value in the array


In [None]:
# Find the min value in the array


In [None]:
# Sum all of the values in the array


In [None]:
# Find the standard deviation of the values in the array


In [None]:
# Get the absolute value of the items in the array

In [None]:
# What happens when you do this

dir(array)

You can also use NumPy itself to call certain functions.

In [None]:
# Calculate the median of the array

np.median(array)

In [None]:
# How would you square the values in the array?


In [None]:
# How would you get the sqare root of the values in the array?


So far, we've been using an array with a single dimension, similar to just using one column in Excel. However, arrays can also be multi-dimensional.

In [None]:
# Generate a multi-dimensional array with 64 items using np.arange and .reshape


Sometimes we want to extract a subset of data from an array. This process is called **slicing**.

In [None]:
# Get a slice of the rows


In [None]:
# Get a slice of the columns


In [None]:
# Get a slice of both the rows and the columns


In [None]:
# Get a specific value


Here's a very good tutorial on NumPy: https://www.datacamp.com/community/tutorials/python-numpy-tutorial

## Pandas

Pandas is a high-performance, open source library for data analysis in Python developed by Wes McKinney in 2008. Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, a large community behind it, rapid iteration, features and enhancements continuously made.

- It can process a variety of data sets in different formats: time series, tabular heterogeneous, and matrix data.
- It facilitates loading/importing data from varied sources such as CSV and DB/SQL.
- It can handle a myriad of operations on data sets: subsetting, slicing,  ltering, merging, groupBy, re-ordering, and re-shaping.
- It can deal with missing data according to rules defined by the user/ developer: ignore, convert to 0, and so on.
- It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
- It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.

In [None]:
# import the pandas library


In [None]:
# create a pandas series

series = pd.Series(
    [0.25, 0.5, 0.75, 1.0], 
    index=['a', 'b', 'c', 'd']
)

In [None]:
# return a row from the series


In [None]:
# change a value in the series


In [None]:
# turn a Python dictionary into a pandas series

population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135,
}

In [None]:
# get the population for California


In [None]:
# turn a Python dictionary into a pandas data frame

data = {
    'feature_one': [1, 2, 4, 8, -3],
    'feature_two': ['haight', 'mission', 'geary', 'castro', 'soma'],
    'feature_three': [True, True, False, True, False],
}

In [None]:
# return the colums of a dataframe


In [None]:
# return the index of a dataframe


In [None]:
# return the numpy array version of the data frame (also works on a series)


In [None]:
# get the type of the data frame


In [None]:
# get the feature_one column


In [None]:
# get the type of the feature_one column


In [None]:
# get multiple columns from the data frame


In [None]:
# add a new column to the data frame


The first dataset we'll work with is the drinks dataset.

In [None]:
# here's the location of the drinks.csv file
path = '../../data/drinks.csv'

In [None]:
# read the drinks dataset into a pandas data frame


In [None]:
# take a look at some of the data


In [None]:
# let's make the country column the index using .set_index()


In [None]:
# let's take another look at our data -- this time, let's look at the last few rows


In [None]:
# How many rows and columns are there in the dataset?


In [None]:
# let's take a look at some of the details of this dataset using .info()


In [None]:
# we can also calculate some summary statistics


## Slicing Data Frames

In [None]:
# select the values in the Peru row using .loc[]


In [None]:
# select values in the wine_servings column


In [None]:
# slice the data frame by rows and columns


In [None]:
# use .iloc[] to get the row at index 48


In [None]:
# return the column at index 1


In [None]:
# return a slice of rows and columns


## Conditional Selection

In [None]:
# check if the the row's continent is 'EU'


In [None]:
# check if the row's wine_servings > 20


Now we can take those commands and pass them into the drinks data frame to get a subset of the data.

In [None]:
# get the data for countries in Europe


In [None]:
# get the data for non-European countries


In [None]:
# get the data for countries with more than 20 wine servings


In [None]:
# get the data for countries in Europe with more than 20 wine servings


In [None]:
# get the data for countries in Europe OR countries with more than 20 wine servings


In [None]:
# call index to return just the names of the countries with more wine servings than beer servings


Boolean values (True, False) are essentially encoded as 0 and 1 therefore, we can sum them.

In [None]:
# How many countries consume no beer at all?


## Pandas Series

In [None]:
# assign the beer_servings column to the variable 'beer'


In [None]:
# we can do math operations, similar to numpy arrays


In [None]:
# What's the average servings of beer consumed?


In [None]:
# What's the median servings of beer consumed?


In [None]:
# What's the total worldwide beer consumption?


In [None]:
# we can add series together -- create a new column called 'total_servings'


In [None]:
# let's find out how many null values there are in continent


In [None]:
# replace every null value with "No Continent" using .fillna()


In [None]:
# What are the different continents?


In [None]:
# How many unique continents are there? (use .nunique())


In [None]:
# How many countries are from each continent? (use .value_counts())


In [None]:
# What percentage of the data belongs to each continent?


In [None]:
# get the top 5 booziest countries (using .sort_values())
