# 🐼 Pandas for Data Analysis

Pandas is a python library that helps you quickly interview, analyze, manipulate and process data. A python library is a collection of functions and methods that allow you to write less code in order to do the same thing.

For example, below is an example of how you would import `.csv` data without pandas:

In [4]:
import csv

with open('data/avocado-short.csv', newline='') as csvfile:
    avoreader = csv.reader(csvfile)
    for row in avoreader:
        print(row)

['', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region']
['0', '2015-12-27', '1.33', '64236.62', '1036.74', '54454.85', '48.16', '8696.87', '8603.62', '93.25', '0', 'conventional', '2015', 'Albany']
['1', '2015-12-20', '1.35', '54876.98', '674.28', '44638.81', '58.33', '9505.56', '9408.07', '97.49', '0', 'conventional', '2015', 'Albany']
['2', '2015-12-13', '0.93', '118220.22', '794.7', '109149.67', '130.5', '8145.35', '8042.21', '103.14', '0', 'conventional', '2015', 'Albany']
['3', '2015-12-06', '1.08', '78992.15', '1132', '71976.41', '72.58', '5811.16', '5677.4', '133.76', '0', 'conventional', '2015', 'Albany']
['4', '2015-11-29', '1.28', '51039.6', '941.48', '43838.39', '75.78', '6183.95', '5986.26', '197.69', '0', 'conventional', '2015', 'Albany']
['5', '2015-11-22', '1.26', '55979.78', '1184.27', '48067.99', '43.61', '6683.91', '6556.47', '127.44', '0', 'conventional', '2015', 'Albany']

And here's how you'd do it with pandas:

In [2]:
import pandas as pd

short_avo = pd.read_csv('data/avocado-short.csv')
print(short_avo)

    Unnamed: 0        Date  AveragePrice  Total Volume     4046       4225  \
0            0  2015-12-27          1.33      64236.62  1036.74   54454.85   
1            1  2015-12-20          1.35      54876.98   674.28   44638.81   
2            2  2015-12-13          0.93     118220.22   794.70  109149.67   
3            3  2015-12-06          1.08      78992.15  1132.00   71976.41   
4            4  2015-11-29          1.28      51039.60   941.48   43838.39   
5            5  2015-11-22          1.26      55979.78  1184.27   48067.99   
6            6  2015-11-15          0.99      83453.76  1368.92   73672.72   
7            7  2015-11-08          0.98     109428.33   703.75  101815.36   
8            8  2015-11-01          1.02      99811.42  1022.15   87315.57   
9            9  2015-10-25          1.07      74338.76   842.40   64757.44   
10          10  2015-10-18          1.12      84843.44   924.86   75595.85   
11          11  2015-10-11          1.28      64489.17  1582.03 

You can already see that pandas has saved us quite a bit of writing. It's also super fast for large datasets and it's very well documented so when you run into issues, you'll be able to find your way out pretty quickly. And believe you me... you will absolutely run into issues. Because even people who seem to know what they're talking about (✋) need to google the heck out of things sometimes.

## Interviewing Data

One of the first things you're going to want to do with a dataset is figure out what you've got. We do this by using different pandas functions that allow us to "interview" or get an overview of the dataset.

The data we used above `data/avocado-short.csv` is fun, but let's move on to a dataset a little more applicable to what we might actually us panads for.

In [46]:
#nursing home facility data
nfacs = pd.read_csv('data/facilities.csv')

### Interview Functions

`df.head()` - get the first 5 rows of your data

`df.tail()` - get the last 5 rows of your data

`df.sample(5)` - get a random sampling of 5 rows of your data

`df.columns` - get a list of all the columns

`df.info()` - get number of rows with data and data type for each column 

`df.shape` - get the number of rows and columns

`df.describe()` - get a variety of statistical calculations for all values in each column

Let's take these functions for a spin:

In [13]:
nfacs.head()

Unnamed: 0,facid,fac_type,capacity,fac_name,fac_address,city_state_zip,Unnamed: 6,owner,operator
0,385008,NF,96.0,Presbyterian Community Care Center,1085 N Oregon St,"Ontario, OR 97914",,"Presbyterian Nursing Home, Inc.","Presbyterian Nursing Home, Inc."
1,385010,NF,159.0,Laurelhurst Village Rehabilitation Center,3060 SE Stark St,"Portland, OR 97214",,"Laurelhurst Operations, LLC","Laurelhurst Operations, LLC"
2,385015,NF,128.0,Regency Gresham Nursing & Rehabilitation Center,5905 SE Powell Valley Rd,"Gresham, OR 97080",,Regency Gresham Nursing & Rehabilitation Cente...,"Regency Pacific Management, LLC"
3,385018,NF,98.0,Providence Benedictine Nursing Center,540 South Main St,"Mt. Angel, OR 97362",,Providence Health & Services - Oregon,Providence Health & Services - Oregon
4,385024,NF,91.0,Avamere Health Services of Rogue Valley,625 Stevens St,"Medford, OR 97504",,"Medford Operations, LLC","Medford Operations, LLC"


In [14]:
nfacs.tail()

Unnamed: 0,facid,fac_type,capacity,fac_name,fac_address,city_state_zip,Unnamed: 6,owner,operator
639,70M258,ALF,70.0,Avamere Living at St. Helens,2400 Gable Rd.,"St. Helens, OR 97051",,"Avamere - St. Helens Operations, LLC","Avamere-St.Helens Operations, LLC"
640,70M313,ALF,84.0,"Springs at Tanasbourne II, LLC",1950 NW 192nd Avenue,"Hillsboro, OR 97124",,"Springs at Tanasbourne II, LLC","The Springs Living, LLC"
641,70M350,ALF,119.0,"Village at Keizer Ridge, The",1165 McGee Court,"Keizer, OR 97303",,"VKR, LLC","Keizer Care Properties, LLC"
642,7MU215,ALF,126.0,St. Anthony Village,3560 SE 79th Avenue,"Portland, OR 97206",,St. Anthony Village Associates LP,SAGE
643,0O0O0O,,57.0,Fake Facility,1234 Fake St,"Nowheresville, NY 05400",,Fake Company,"Not a Company, LLC"


In [15]:
nfacs.sample(5)

Unnamed: 0,facid,fac_type,capacity,fac_name,fac_address,city_state_zip,Unnamed: 6,owner,operator
129,3.80E+189,NF,80.0,Gracelen Terrace Long Term Care Facility,10948 SE Boise St,"Portland, OR 97266",,"H & L Care Centers, Inc.","H & L Care Centers, Inc."
333,50R380,RCF,20.0,Countryside Living South,406 NW 2nd Avenue,"Canby, OR 97013",,"Countryside Living of Canby, LLC","Countryside Living of Canby, LLC"
506,70M008,ALF,49.0,Awbrey Place,2825 Neff Rd,"Bend, OR 97701",,"AWBREY AID OPCO, LLC","AWBREY AID OPCO, LLC"
156,50M055,RCF,88.0,Brookdale Mt. Hood,25200 SE Stark St,"Gresham, OR 97030",,"Brookdale Senior Living Communities, Inc.","Brookdale Senior Living Communities, Inc."
241,50R271,RCF,21.0,Willamette View Memory Care Community,13145 SE River Rd,"Portland, OR 97222",,"Willamette View, Inc.","Willamette View, Inc."


In [16]:
nfacs.columns

Index(['facid', 'fac_type', 'capacity', 'fac_name', 'fac_address',
       'city_state_zip', 'Unnamed: 6', 'owner', 'operator'],
      dtype='object')

In [17]:
nfacs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 9 columns):
facid             644 non-null object
fac_type          642 non-null object
capacity          642 non-null float64
fac_name          644 non-null object
fac_address       644 non-null object
city_state_zip    644 non-null object
Unnamed: 6        0 non-null float64
owner             644 non-null object
operator          644 non-null object
dtypes: float64(2), object(7)
memory usage: 45.4+ KB


In [19]:
nfacs.shape

(644, 9)

In [22]:
nfacs.describe()

Unnamed: 0,capacity,Unnamed: 6
count,642.0,0.0
mean,57.551402,
std,34.196204,
min,5.0,
25%,30.0,
50%,53.0,
75%,79.0,
max,214.0,


You might notice that we only got back information on two of our columns here. `df.describe()` only works on numeric columns.

## Asking Questions
At this point, we can start asking questions of our data. 

- Which facility has the largest capacity?
- What is the average/max/min capacity for all facilities?
- How many facilities of each type are there?
- How many facilities does each state have?

### Which facility has the largest capacity?

In [24]:
nfacs.sort_values('capacity', ascending=False).head()

Unnamed: 0,facid,fac_type,capacity,fac_name,fac_address,city_state_zip,Unnamed: 6,owner,operator
93,385240,NF,214.0,Marian Estates,390 Church St,"Sublimity, OR 97385",,"Ernmaur, Inc.",Marian Estates Support Services
259,50R293,RCF,186.0,Miramont Pointe,11520 SE Sunnyside Rd,"Clackamas, OR 97015",,"MP, LLC","MP, LLC"
571,70M080,ALF,180.0,Rose Schnitzer Manor,6140 SW Boundary St,"Portland, OR 97221",,Robison Jewish Home,Robison Jewish Home
20,385112,NF,180.0,West Hills Health & Rehabilitation Center,5701 SW Multnomah Blvd,"Portland, OR 97219",,West Hills Convalescent Center Limited Partner...,West Hills Convalescent Center Limited Partner...
50,385166,NF,165.0,Maryville Nursing Home,14645 SW Farmington Rd,"Beaverton, OR 97007",,Sisters of St. Mary of Oregon Maryville Corp.,Sisters of St. Mary of Oregon Maryville Corp.


### What is the average/max/min capacity for all facilities?

In [25]:
nfacs.capacity.median()

53.0

In [26]:
nfacs.capacity.max()

214.0

In [27]:
nfacs.capacity.min()

5.0

### How many facilities of each type are there?

In [30]:
nfacs.groupby('fac_type').facid.agg('count')

fac_type
ALF    221
NF     137
RCF    284
Name: facid, dtype: int64

### How many facilities does each state have?
In order to determine this, we're going to have to separate out the state value from the column `city_state_zip`. Since excel was my first analysis tool, I always think "How would I do this in excel?" first. I would probably write an excel function that would first separate out the state and the zip and then I would break that new column apart at the space. We can do that in python too!

In [47]:
# separate beginning of entry to comma
nfacs['state'] = (nfacs['city_state_zip'].str[-8:]).str.split(" ").str[0]
nfacs.head()

Unnamed: 0,facid,fac_type,capacity,fac_name,fac_address,city_state_zip,Unnamed: 6,owner,operator,state
0,385008,NF,96.0,Presbyterian Community Care Center,1085 N Oregon St,"Ontario, OR 97914",,"Presbyterian Nursing Home, Inc.","Presbyterian Nursing Home, Inc.",OR
1,385010,NF,159.0,Laurelhurst Village Rehabilitation Center,3060 SE Stark St,"Portland, OR 97214",,"Laurelhurst Operations, LLC","Laurelhurst Operations, LLC",OR
2,385015,NF,128.0,Regency Gresham Nursing & Rehabilitation Center,5905 SE Powell Valley Rd,"Gresham, OR 97080",,Regency Gresham Nursing & Rehabilitation Cente...,"Regency Pacific Management, LLC",OR
3,385018,NF,98.0,Providence Benedictine Nursing Center,540 South Main St,"Mt. Angel, OR 97362",,Providence Health & Services - Oregon,Providence Health & Services - Oregon,OR
4,385024,NF,91.0,Avamere Health Services of Rogue Valley,625 Stevens St,"Medford, OR 97504",,"Medford Operations, LLC","Medford Operations, LLC",OR


Now we can use the `df.groupby()` function we learned last step and see how many facilities are in each state in our dataset.

In [48]:
nfacs.groupby('state').facid.agg('count')

state
NY      1
OR    643
Name: facid, dtype: int64

### More than one way to peel a potato 🐈👍
I think it's important at this point to let you know that there are often a variety of different ways to do any one thing. Pandas is like life, y'all. 

In your travels, you might find that you need to use [`.eval()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html) to compute a new value. That's fine. Depending on what you need to do, maybe that's better. 

Or maybe you need [.apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) or [.map()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html). 

As you gain more experience, and deal with more error messages, you'll learn when you need which function.

## Filtering
One way to filter your dataset is by grabbing only certain values that are **equal to** some other value. 

Let's get rid of that one NY record:

In [52]:
or_nfacs = nfacs[nfacs['state'] == 'OR']
or_nfacs.state.unique()

array(['OR'], dtype=object)

You can also filter on as sliding scale of comparison. 

Let's grab all of the facilities that have more than a 10 bed capacity:

In [None]:
large_nfacs = nfacs[n]