# Recap Last Week

We covered:
* Reading and Writing csv files with the built in CSV module
* installing 3rd party packages with pip
* Reading and Writing Excel files with the openpyxl module

# This Week

what we will cover:

* Pandas basic data structures: Series and Dataframes
* Basics of Data analysis with Pandas
* Data visualization with matplotlib

This week we'll be using pandas to analyse data, and matplot lib to visualize the data. If you don't have these libraries installed, open up a terminal window or command prompt and type:

    pip install pandas matplotlib

This will install all the libraries we'll need for this class.

We'll also be analyzing a UFO sighting dataset that you find can find [here](https://www.kaggle.com/NUFORC/ufo-sightings). Once you make a free account you'll have access to that, and many more data sets.

### What is a Pandas?

Pandas, short for **panel data**, is a 3rd party module in Python, which means you'll need to install it before you can use it.
    
Pandas is a library that provides high level data structures and manipulation tools to make data analysis fast, easy, and enjoyable in Python. Pandas is built on top of NumPy -- an extreamly efficient scientific computing library for Python -- and if you want to learn more about the engine that powers pandas I highly encourage you to check out these links:

* [Numpy Basics](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html)
* [Numpy Array Computation](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html)
* [Numpy Aggregation Functions](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html)

### Getting Started with Pandas

In general when working with pandas, you'll rely on two main data structures: ``Series`` and ``DataFrame``

Typicall import convention is: ``import pandas as pd``

In [None]:
import pandas as pd 

### Pandas Series

A Series is a one-dimensional, list-like object provided by pandas. The data will be stored in a Numpy array, and will also have an associated array of labels, otherwise known as the index.

#### Creating a Series from a List

In [None]:
ser1 = pd.Series(['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])

print(type(ser1))

In [None]:
print(ser1)

Notice, when you print out the string representation of a Series object, you'll see the list of data on the right hand side and the list of labels on the left hand side. Because we didn't supply an index a numeric index was provided for us. This is the default behavior when working with both Series and Dataframes

#### We can directly access the array data of our Series object using the .values attribute, and the labels of our Series using the .index attribute

In [None]:
ser1.values

In [None]:
ser1.index

#### Creating a Series with an index

It's usually useful to include an index for the data that you're trying to analyze. This makes it much easier to locate specific values within the Series

In [None]:
ser2 = pd.Series([1, 2, 3, -6, 7, 0, 5, -4], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(ser2)

As you'll notice this time the index is set to exactly the values that we specified

#### Accessing values from the Series

You can use both the array index, or row labels to access values stored in a Series

In [None]:
# access via the array index
print(ser2[0])

# access via the index label
print(ser2['a'])

#### Accessing Multiple values from the Series

If you need to access multiple values, you can pass a list of indexs or labels. This type of value access is made possible becuase of Numpy, and I highly encourage you to go back and check out the links I provided earlier if you want to learn more about it.

In [None]:
print(ser2[[0, 1, 2]])


print(ser2[['d', 'e', 'f']])

#### Reassigning Values in the Series

It's also fairly easy to reassign values of any Series. In order to not change the orginal Series, we'll make a copy of it using the ``.copy`` method first.

In [None]:
ser3 = ser2.copy()
print(ser3, end='\n\n')

ser3['d'] = 100
print(ser3, end='\n\n')

You should notice that reassigning values in a series is syntactically identical to reassigning values in a Python dictionary. In a way, a Series is a super charged dictionary because it allows you to access values via their row labels (keys), and via their index.

#### Creating a Series from a Dictionary

In fact, you can pass a dictionary as an argument when instantiating a Series, and the keys will automatically be assigned as the row labels. 

In [None]:
# nobel gasses from the period table: mapping names to atomic numbers
# data found at: https://en.wikipedia.org/wiki/Noble_gas
noble_gases  = {'Helium':2, 'Neon':10, 'Argon':18, 'Krypton':36, 'Xenon':54,'Radon':86}

ng_series = pd.Series(noble_gases)
# string representation of a the ng_series
print(ng_series, end='\n\n')

# indexing the first value of the series
print(ng_series[0], end='\n\n')

# using dictionary lookup to find the value of 'Argon'
print(ng_series['Argon'], end='\n\n')

#### Filtering the Series

You can create Boolean indexes by using logical operations on a Series. A boolean index is just a Series where each data point is either True or False.

In [None]:
# create a boolean index for all values less than 36
ng_series < 36

You can then use these Boolean indexes to filter the data in the Series. Only values that evaluate to True in the Boolean index will remain when filtered

In [None]:
# filter the original Series where the atomic number is less than 36
ng_series[ng_series < 36]

### Pandas DataFrame

A DataFrame is a tabular data structure provided by Pandas. You can think of a DataFrame as a collection of Series with the same index. A DataFrame is probably the data structure that you'll use the most in pandas, and it is indexed by both its row labels and its column headers

### Create a Dataframe from a list of lists

In [None]:
matrix = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]

df = pd.DataFrame(matrix)

# What's the Type of a DataFrame?
print(type(df))

In [None]:
# The string representation of a DataFrame
print(df)

similar to a Series, if you don't provide a specific index, a numeric index will be assigned automatically. In the previous example we didn't specify a column or row index so pandas provided one for us.

You can inspect the index using ``.index`` attribute, and you can use the ``.columns`` attribute to list the columns

In [None]:
print(df.index)

print(df.columns)

#### Reassigning the index (Column and Row)
We can reassign the index if we need to by providing a list of of values with a lenght equal to the length of the current index. Both the column headers and the row labels are 3 items long, so in order to replace them we can assign a list of 3 items

In [None]:
df.index = ['a', 'b', 'c']
df.columns = ['col1', 'col2', 'col3']

print(df)

#### Creating a DataFrame with indexes already assigned

It's often easier to assign column and row labels when instantiating the DataFrame object. You can do so by providing index and columns keyword arguments when instantiating the DataFrame

In [None]:
scores = [[90, 87, 85, 76], [93, 91, 91, 84], [70, 75, 80, 99], [83, 88, 91, 77]]
students = ['Bob', 'Alice', 'Carter', 'Dan']
subject = ['Math', 'Science', 'History', 'English']

test_scores = pd.DataFrame(scores, index=subject, columns=students)
print(test_scores)

#### Accessing DataFrame Columns

Accessing Values from a Series was fairly straightforward because there was only a single axes where data was stored. Accessing data from a DataFrame is slightly different becuase you can access Data via the row or the column

Using Dictionary look up only works for the Columns of a DataFrame

In [None]:
# Get the Series of Scores for Bob
bob_col = test_scores["Bob"]

print(bob_col, end='\n\n')

print(type(bob_col), end='\n\n')

**NOTE: As previously mentioned, each column is itself a Series, with the same index as the entire DataFrame**

We can pass a list of columns to filter the DataFrame for only the columns we need

In [None]:
a_and_b = test_scores[["Alice", "Bob"]]

print(a_and_b, end='\n\n')

print(type(a_and_b), end='\n\n')

**NOTE: Filtering the DataFrame for more than one column will return a DataFrame, NOT A SERIES** 

#### Accessing DataFrame Rows

Earlier, we were able to access the values of a Series by either their row label or array index. When working with DataFrames, you can use dictionary lookup to  the Because Dictionary lookup on a DataFrame is reserved for the columns, using slicing notation will result in looking up row. A DataFrame also has two attirbutes which support lookup on the DataFrame rows

* loc - used for row look up using the row label
* iloc - used for row look up using the index of the row

**Note: trying to index a single row using slicing notation will throw an error!!!!! This is becuase dictionary lookup is assumed, and you'll receive a key error**

In [None]:
# this wont work
# test_scores[0]

In [None]:
# using slicing notation
print(test_scores[:3])

#### Using loc

In [None]:
math_row = test_scores.loc["Math"]
print(math_row)
print(type(math_row))

**Note: Maybe this is intuitive based on the fact that the column headers are also treated like an index, but taking a row of the DataFrame, provides a Series indexed by the column header of the original DataFrame**

In [None]:
# You can use slicing sintax with the row labels
test_scores.loc["Math":"History"]

In [None]:
# you can pass a list of row labels
test_scores.loc[["Science", "English"]]

In [None]:
test_scores.loc["Math", "Bob"]

#### Using iloc

Again, iloc is used to referect the rows of that DataFrame by their index

In [None]:
# accessing a single row
test_scores.iloc[0]

**Note: Using iloc allows you to access a single row, from the DataFrame, where trying to us index notation will raise a KeyError**

#### Adding New Columns to a DataFrame

We can assign new columns to a DataFrame by adding a new Series with identical row labels

In [None]:
test_scores['Ed'] = pd.Series([72, 80, 79, 77], index=subject)

print(test_scores)

As you can see, a column for Ed has now been added to the DataFrame

#### Filtering the DataFrame With Boolean Indexing

As we did before, we can use Boolean indexing to to filter the DataFrame. If we use a conditonal logic expression on a DataFrame, each cell will be evaluated against the expression

In [None]:
test_scores > 80

When applying the boolean DataFrame to the original DataFrame, only True values remain, and the rest are filled with NaN.

In [None]:
test_scores[test_scores > 80]

It's often more meaningful to filter along a row or a column, for example Filtering For Students with Math Scores better than 85

In [None]:
# use .transpose to flip the column and row labels, because normally we can't use dictionary lookup on row labels
test_scores.transpose()[test_scores.transpose()["Math"] > 85]

Sure enough, the new DataFrame only contains values where Math was > 85

Look here for more info on [.transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html)

#### Aggregate Functions

Aggregat function let you quickly and easily gather information on the data that you're working with. The result of an aggregation on a DataFrame is a Series

In [None]:
# What was the average test score for each student?
test_scores.mean()

In [None]:
# What was the highest test score for each student?
test_scores.max()

In [None]:
# What was the lowest test score for each student?
test_scores.min()

By defulat these aggregations will opperate on the columns. If you want to aggregate across the rows, you'll either need to transpose your data (flip the column and rows), or you'll need to change the axis that the aggregation is applied to.

In [None]:
# What was the average test score in each subject?
test_scores.mean(axis=1)

In [None]:
# What was the highest test score in each subject?
test_scores.max(axis=1)

In [None]:
# What was the lowest test score in each subject?
test_scores.min(axis=1)

In [None]:
# what was the standard deviation of test scores in each subject?
test_scores.std(axis=1)

### A Bit about efficiency

As you've become more familiar with Python, you've gotten comfortable with the **for loop**. When working with pandas, it's generally not a good idea to use for loops. because they are **EXTREMELY** inefficient.

In [None]:
import numpy as np
random_df = pd.DataFrame(np.random.randn(100, 100))

print(random_df.head())

def square(value):
    return value ** 2

##### Using a normal for Loop

In [None]:
%%timeit

for row in random_df:
    square(random_df.iloc[row])


##### Using iterrows (if you absolutely have to loop)

In [None]:
%%timeit

for index, row in random_df.iterrows():
    square(row)

##### Use .apply, it's more efficient!

In [None]:
%%timeit

random_df.apply(square)

In [None]:
%%timeit
random_df.apply(np.square)


In [None]:
%%timeit

random_df ** 2

In [None]:
%%timeit

np.square(random_df)

Pandas code generally runs more efficently when it's vectorized. As you can see from above the two solutions that were the most efficient were the ones that used the optimized numpy functions. Also note that using .iterrows and .apply were faster than using the for loop. These were all numeric calculations so they ran pretty quickly anyway, but you could forseeably have other data types, and other opperations that you'd like to apply to an entire DataFrame or Series, and you'll almost always see better performace if you:

    1) user .apply and .iterrows
    2) use optimized numpy functions

### Data Analysis

We'll take what we've learned about how to work with pandas DataFrames and Series, and apply that to analyzing a [ufo sighting](https://www.kaggle.com/NUFORC/ufo-sightings) dataset. If you haven't already done so, please download and unzip the data

### Reading in Data

In [None]:
from pathlib import Path

# You technically don't need a Path object here, but I like using it because it alllows me to create file system paths
# that work regardless of the operating system
ufo_csv_path = Path('ufo-sightings/scrubbed.csv')
ufo_csv_path.absolute()

ufo_df = pd.read_csv(ufo_csv_path)
# Pandas has the ability to quickly read data in from may file formats for example:
# Excel, Json, HTML, SQL, etc.

#### Inspecting the Data

When doing any exploratory data analysis it's usually a good idea to check out our data and see what we're working with. You can use the ``.head`` meathod of a pandas Series or DataFrame to inspect the first few rows. You can also use the ``.tail`` method to inspec the last few rows

In [None]:
ufo_df.head()

In [None]:
ufo_df.tail()

DataFrames also have a useful ``.describe`` method, which makes it really easy to get some quick statistics on the data

In [None]:
ufo_df.describe()

**Note: the .describe method will only evaluate numeric data types.**

The main take away from looking at the result of ``.describe`` is realizing that there are **over 80,000 rows** in our dataset and pandas is able to handle that easily!!!

Because longitude was the only column that got evaluated, I'm thinking its possible that not all the data is of the approprate datatype. This might be because we read the data in from a csv file. We can quicly check that datatype of each column by using the ``.dtypes`` attribute of a DataFrame.

#### Casting Data

In [None]:
ufo_df.dtypes

When you see **object**, that's pandas way of saying that the values are strings. You'll also note that longitude is a **float64**. The important thing to take away here is that Pandas (but really Numpy) is built on top of C --not Python -- which means it stores values as C data types. Being built on C is one of the reasons Pandas is quit fast when written well. Although not all of our data is numeric, lets work on casting as much as we can into an appropriate datatype. This means:

* Casting the 'datetime' column to a datetime type
* Casting the 'duration (seconds)' column to an integer
* Casting the 'date posted' column to a date
* Casting the 'latitude' column to a float

In [None]:
# There a tons of ways to cast data. In fact you can also cast data types when reading in the data
# you can apply a mapping function to a Series 
ufo_df[['duration (seconds)', 'latitude']] = ufo_df[['duration (seconds)', 'latitude']].apply(pd.to_numeric, errors='coerce')


ufo_df['datetime'] = pd.to_datetime(ufo_df['datetime'], format='%d/%m/%Y %H:%M', errors='coerce')

ufo_df['date posted'] = pd.to_datetime(ufo_df['date posted'], format='%d/%m/%Y', errors='coerce')

# check out the data types again
print(ufo_df.dtypes)

For more info on **pd.to_numeric** look [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)

For more info on **pd.to_datetime** look [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)

#### Droping NaN

Just from inspecting the head and tail of the data we see that there are some NaN values. Normally we'd want to go in and preserve as much data as we could, but for this analysis I'll choose to remove any row containing a NaN value

In [None]:
# This will remove all rows with NaN in place. 
ufo_df.dropna(inplace=True)

Lets get a better sense of what countries these sightings occured. To get a list of all the unique cities where ufo's were sighted, we can use the .unique method on a pandas Series

In [None]:
ufo_df['country'].unique()

So we can see that 4 countries contributed to this dataset. Maybe we'd like to know how many sightings took place in each of these countries. A quick way to do this is to use the .value_counts attribute of a pandas Series

In [None]:
country_counts = ufo_df['country'].value_counts()
print(country_counts)

With very little effort at all we can see that a majority of the data came from the us

#### Easy plots (must have matplotlib installed)

both Series and DataFrames can be easily graphed using the ``.plot`` method

In [None]:
country_counts.plot(kind='bar')

#### Filtering the Data

Becaus the us makes up an overwhelming amount of the dataset, lets focus our exploration just on the us sightings

In [None]:
us_ufos = ufo_df[ufo_df['country'] == 'us']

Now, lets see which cities had the most activity in our dataset, instead of directly accessing the "state" column, we'll use a groupby opperation to acheive an identical result

In [None]:
state_counts = us_ufos.groupby('state').count()

In [None]:
state_counts.head()

Taking a look at the head of our data, we can see that each column was grouped by state, and the count was taken. You'll also notice that **state** has become the new index of the resulting grouped DataFrame. You should also note that each column now has identical values, which means we can really use any of these columns (or Series), when trying to plot out ufo sightings by state

In [None]:
state_counts.sort_values('city', ascending=True)['city'].plot(kind='barh', figsize=(12, 12), title='UFO Sightings By US State')

#### On average, how long are UFO sightings

In [None]:
avg_sighting_time = us_ufos['duration (seconds)'].mean() // 60

print(f'UFO Sightings are about {avg_sighting_time} minutes long')

#### How long are UFO sightings by State

In [None]:
(us_ufos.groupby('state').mean()['duration (seconds)'] // 60).sort_values().plot(kind='barh', figsize=(12, 12), title='UFO Sighting Duration by State')

#### When did UFO Sightings occur?

Because we have date and time data spanning several years, we could extract a lot of additional insights into when these sightings occured. For now we'll keep our analysis to sightings per year, and sightings per time of day

In [None]:
time_df = us_ufos[['datetime', 'date posted']].copy()
time_df['year'] = time_df['date posted'].apply(lambda x: x.year)

In [None]:
time_df['hour'] = time_df['datetime'].apply(lambda x: x.hour)


In [None]:
time_df.head()

In [None]:
time_df.groupby('year').count().sort_values('year')['hour'].plot()

In [None]:
time_df.groupby('hour').count()['year'].plot(kind='pie', subplots=True, figsize=(12, 12))

### Additional Resources

* [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) -- Pandas Documentation getting started page
* [Pandas Beginner Tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/) on learndatasci.com
* [Quentin Caudron - Introduction to data analytics with pandas](https://www.youtube.com/watch?v=F7sCL61Zqss&t=2524s), PyData 2017 Conference Workshop. In the 2 hour workshop, Quentin walks you through reading, cleaning, transforming and visualizing your data with pandas
* [Intro to Pandas Tutorial Series](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfSfqQuee6K8opKtZsh7sA9) by sentdex on YouTube
* [Speed Up Your Pandas Project](https://realpython.com/fast-flexible-pandas/) -- Blog post about writing more efficient Pandas Code
* [Python for Data Analysis Book](http://shop.oreilly.com/product/0636920023784.do), by Wes McKinneey, the creator of Pandas
* [High Performance Data Processing in Python](https://www.youtube.com/watch?v=NoJr08FNQeg), PyData 2018 talk by Donald Whyte