# Using the Pandas Package for Data Analysis, pt.1

This notebook will walk us through a quick tutorial in using the pandas package for data anlysis with python.

### Overview of Tutorial

Over the next 2 class sessions, we will use this tutorial to cover the following processes:

*Day 1*
1. importing the pandas package 
2. creating a dataframe
3. exploring our dataframe's attributes

*Day 2*
4. using functions to filter our data
5. using functions to merge and join our data
6. creating a subset and exporting as a new .csv file

### Acknowledgements

This Pandas tutorial has been adapted from materials provided by the excellent staff at the Davis Library Research Hub.

For more detailed examples and exericses, see thier [Python: Intro to Data lessons](https://unc-libraries-data.github.io/Python/Intro/Introduction_CrashCourse.html)

### Importing Pandas

#### Packages
Packages provide additional tools and functions not present in base Python. Python includes a number of packages to start with, the Anaconda distribution which we've all downloaded for Unit 3 comes with the "Pandas" package already installed.

Once you've installed a package, you can load it into your current Python session with the import function. Otherwise these functions will not be available.


#### Pandas

Like spreadsheets in Microsoft Excel, Pandas allows us to store our data in tabular, multi-dimensional objects (dataframes) with familiar features like rows, columns, and headers. This is useful because it makes management, manipulation, and cleaning of large datasets much easier than would be the case using Python's built-in data structures such as lists. Pandas also provides a wide range of useful tools for working with data once it has been stored and structured.

Begin by importing the pandas package using the following command:


In [6]:
import numpy as np
import pandas as pd

Notice that we load pandas with the usual `import pandas` and an extra `as pd` statement. This allows us to call functions from `pandas` with `pd.<function>` instead of `pandas.<function>` for convenience. `as pd` is **not** necessary to load the package.

Note, we also imported the `numpy` package, which is going to help pandas do some of its math.

### Creating a DataFrame

#### Working Directories & Relative Paths

By now, you should have either downloaded the csv file "CountyHealthData_2014-2015.csv" from canvas, or saved your own data as a csv file. I've stored my copy in the same folder as this Jupyter Notebook. **NOTE:** make sure that your csv file is saved in the same working directory as your .ipynb notebook file that you will use. 

Remember that Jupyter Notebooks automatically set your working directory to the folder where the .ipynb is saved. You'll have to save the document at least once to set your directory, but once there you can use what's called relative file paths to access the files there.

If a file is located in your working directory, its relative path is just the name of the file!

#### Using the `pd.read_csv()` function

`pd.read_csv` reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that we'll define as `df`.

To create our dataframe object we'll define our object `df` by executing the `pd.read_csv()`function on our data file by inserting the relative file path into the parathenses.

In [1]:
df=pd.read_csv("Apple.csv")

NameError: name 'pd' is not defined

### Exploring Our Dataframes

#### Attributes

A good first step in exploring our dataframe is to examine some of its basic attributes. Attributes contain **values** that provide  helpful information about the dataframe, that guide our interaction with the dataframe. In pandas, we access attributes with the following syntax:

`<DataFrame name>.<attribute name>`

We can use the `.shape` attribute to determine how many rows and columns (in that order) are available. The `.size` attribute gives us the number of cells in the dataframe (rows * columns).

In [20]:
df.shape

(5852, 7)

In [12]:
df.size

40964

In [13]:
df.size == 6109 * 64

False

Other useful attributes include:

- `.columns` provides the column names for the Dataframe
- `.dtypes` provides the pandas datatype for each column


In [21]:
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [22]:
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

We'll also use attributes (`.loc` and `.iloc`) to interact with our dataframes on Friday.

#### Methods

Much of the functionality for working with dataframes comes in the form of methods. Methods are specialized functions that only work for a certain type of object, with the syntax:

`<object name>.<method>()`

We can look at the first 5 or last 5 rows in the dataset directly with the `.head()` and `.tail()` methods.

In [24]:
df.head(n=100)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1997-05-15,2.437500,2.500000,1.927083,1.958333,1.958333,72156000
1,1997-05-16,1.968750,1.979167,1.708333,1.729167,1.729167,14700000
2,1997-05-19,1.760417,1.770833,1.625000,1.708333,1.708333,6106800
3,1997-05-20,1.729167,1.750000,1.635417,1.635417,1.635417,5467200
4,1997-05-21,1.635417,1.645833,1.375000,1.427083,1.427083,18853200
...,...,...,...,...,...,...,...
95,1997-09-30,4.000000,4.348958,3.802083,4.338542,4.338542,5254800
96,1997-10-01,4.437500,4.500000,3.937500,4.020833,4.020833,4999200
97,1997-10-02,4.041667,4.177083,3.989583,4.010417,4.010417,1876800
98,1997-10-03,4.083333,4.125000,3.979167,4.015625,4.015625,1164000


In [28]:
df.tail(30)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
5822,2020-07-06,2934.969971,3059.879883,2930.0,3057.040039,3057.040039,6880600
5823,2020-07-07,3058.550049,3069.550049,2990.0,3000.120117,3000.120117,5257500
5824,2020-07-08,3022.610107,3083.969971,3012.429932,3081.110107,3081.110107,5037600
5825,2020-07-09,3115.98999,3193.879883,3074.0,3182.629883,3182.629883,6388700
5826,2020-07-10,3191.76001,3215.0,3135.699951,3200.0,3200.0,5486000
5827,2020-07-13,3251.060059,3344.290039,3068.389893,3104.0,3104.0,7720400
5828,2020-07-14,3089.0,3127.379883,2950.0,3084.0,3084.0,7231900
5829,2020-07-15,3080.22998,3098.350098,2973.179932,3008.870117,3008.870117,5788900
5830,2020-07-16,2971.060059,3032.0,2918.22998,2999.899902,2999.899902,6394200
5831,2020-07-17,3009.0,3024.0,2948.449951,2961.969971,2961.969971,4761300


Sometimes, our top and bottom rows aren't very representative, and we'd prefer to look at a random sample of rows to get a better sense of the data. We can do this with `.sample()` **Note** that we can supply the parameter `n` to specify how many rows we want to sample.

In [29]:
df.sample(n=20)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
5642,2019-10-16,1773.329956,1786.23999,1770.52002,1777.430054,1777.430054,2763400
2999,2009-04-17,76.779999,78.720001,75.879997,78.050003,78.050003,7426000
5553,2019-06-11,1883.25,1893.699951,1858.0,1863.699951,1863.699951,4042700
2761,2008-05-07,75.260002,76.639999,73.089996,73.18,73.18,8376600
3165,2009-12-11,136.070007,136.289993,133.199997,134.149994,134.149994,8046700
1394,2002-11-29,24.15,24.379999,23.33,23.35,23.35,2577300
2231,2006-03-29,35.689999,36.810001,35.310001,36.32,36.32,7199200
1324,2002-08-21,15.95,15.97,15.2,15.38,15.38,7284200
4311,2014-07-03,334.829987,338.299988,333.079987,337.48999,337.48999,1944300
5263,2018-04-16,1445.0,1447.0,1427.47998,1441.5,1441.5,2808600


#### Series

We can think of our dataframe as a collection rows and columns where each row represents an "observation"—sometimes referred to as a 'record'—and each column contains a specific type of information collected about each observation. 

In Pandas, our columns are stored as what's called 'Series' objects, and our dataframes can be thought of as named collections of series.

We can extract a single column in a couple of ways:

- bracket notation: `df["Region"]` This is the most robust way to refer to Series

- dot notation: `df.Region` This is simpler and easier to read but not always available


In some cases, dot notation does not work! The most common situations are:

- The column name has a space, or other irregularities 
- The column name is the same as an existing attribute or method (e.g., a column named "shape")

For example, in our Public Health dataFrame, `df.Uninsured adults` doesn't work, because "Uninsured adults" is not understood as a single value, so instead we'd use `df["Uninsured adults"]`

Series have their own set of attributes and methods just like dataframes. Some attributes like `.dtypes` and `.shape` are available for both.

In [41]:
print(df.shape)


(5852, 7)


In [2]:
print(df.column)

NameError: name 'df' is not defined

AttributeError: 'DataFrame' object has no attribute 'Region'

One of the most useful methods for categorical variables is `.value_counts()` which provides a frequency table.

In [47]:
df.Region.value_counts()

AttributeError: 'DataFrame' object has no attribute 'Region'

This can also be used on top of other attributes or methods that return series. For example, the code below shows how frequently each data type appears in our dataframe.

In [20]:
df.dtypes.value_counts()

float64    54
object      6
int64       4
dtype: int64

So for example, we might call up a value count of the series "State" to get a more granular sense of our dataframe's geographical dispersal.

In [21]:
df.State.value_counts()

TX    469
GA    318
VA    266
KY    240
MO    229
IL    204
NC    200
KS    199
IA    198
TN    190
IN    184
OH    176
MN    174
MI    164
MS    163
NE    157
OK    154
AR    150
WI    144
PA    134
FL    134
AL    134
LA    128
NY    124
CO    119
SD    117
CA    114
WV    110
MT     92
SC     92
ND     92
ID     84
WA     78
OR     67
NM     64
UT     54
MD     48
AK     46
WY     46
NJ     42
NV     32
ME     32
AZ     30
VT     28
MA     28
NH     20
CT     16
RI     10
HI      8
DE      6
DC      1
Name: State, dtype: int64

#### Now open the `.ipyn` files you created last time: 
1. import pandas and numpy
2. create a dataframe using `pd.read_csv`
3. start explorng your own data!

In [4]:
df.head()

NameError: name 'df' is not defined