<a href="https://colab.research.google.com/github/stephenfrein/py_packages_data_analysis/blob/master/python_packages_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Useful Python Packages for Data Analysis

This is a Jupyter notebook running in Google's Colab environment that we will use to learn about some Python packages that are useful for data analysis.

You can write and execute your Python code right in the browser here. No additional setup is required.

The main packages we will cover today are *pandas* (used for manipulating tabular data) and *matplotlib* (used to create graphs).

# Pandas

The pandas library is essential for data analysis in Python. It allows you to maniulate tabular data structures, such as you would find in a relational database or spreadsheet.

Some things we'll do with pandas:
*   Load data
*   Describe that data
*   Reshape the data




# Load Data

Let's get some data first. We can load data from a variety of formats, including:
*   text files (CSV, fixed-width)
*   JSON
*   HTML
*   MS Excel
*   SQL
... and lots of others as well.

We can also pull data from a filesystem (just give the path) or a URL.

To get things started, we'll load some CSV data about the Titanic from a URL.

In [0]:
# it's conventional to alias pandas as pd once imported
import pandas as pd
url="https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
# pandas will read this data into a DataFrame, the typical pandas data structure
# df is a common abbreviation used in DataFrame variables 
titanic_df=pd.read_csv(url)

# Explore Our Data

Now that we've loaded our data, let's take a look at it. 

In [0]:
# let's see the first rows
titanic_df.head(n=10)
# can also do tail()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
6,0,1,Mr. Timothy J McCarthy,male,54.0,0,0,51.8625
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.075
8,1,3,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,female,27.0,0,2,11.1333
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708


In [0]:
# let's see some stats about the overall data set
titanic_df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [0]:
# wait a minute - that only gave us the numerics
titanic_df.describe(include="all")

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887,887,887.0,887.0,887.0,887.0
unique,,,887,2,,,,
top,,,Dr. Alfred Pain,male,,,,
freq,,,1,573,,,,
mean,0.385569,2.305524,,,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,,,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,,,0.42,0.0,0.0,0.0
25%,0.0,2.0,,,20.25,0.0,0.0,7.925
50%,0.0,3.0,,,28.0,0.0,0.0,14.4542
75%,1.0,3.0,,,38.0,1.0,0.0,31.1375


# Exercise

Now you try. There is a data set of blah at URL. Go get it, and find out which bleep has the highest average blonk.

In [0]:
# write your code in here below
url="https://drive.google.com/uc?export=download&id=1mKW-hqj7NVd-ong7sae-AMOCW75oTHZD"
covid_df=pd.read_csv(url)
covid_df.head()
covid_df.describe(include="all")

Unnamed: 0,date,county,state,fips,cases,deaths
count,70218,70218,70218,69315.0,70218.0,70218.0
unique,89,1647,55,,,
top,2020-04-18,Washington,Texas,,,
freq,2762,862,4262,,,
mean,,,,29664.325586,137.243228,4.953758
std,,,,15509.836457,1703.652015,98.072606
min,,,,1001.0,0.0,0.0
25%,,,,17197.0,2.0,0.0
50%,,,,28151.0,7.0,0.0
75%,,,,44005.0,28.0,1.0
