# 🐼 Pandas for Data Analysis

Pandas is a python library that helps you quickly interview, analyze, manipulate and process data. A python library is a collection of functions and methods that allow you to write less code in order to do the same thing.

For example, below is an example of how you would import `.csv` data without pandas:

In [4]:
import csv

with open('data/avocado-short.csv', newline='') as csvfile:
    avoreader = csv.reader(csvfile)
    for row in avoreader:
        print(row)

['', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year', 'region']
['0', '2015-12-27', '1.33', '64236.62', '1036.74', '54454.85', '48.16', '8696.87', '8603.62', '93.25', '0', 'conventional', '2015', 'Albany']
['1', '2015-12-20', '1.35', '54876.98', '674.28', '44638.81', '58.33', '9505.56', '9408.07', '97.49', '0', 'conventional', '2015', 'Albany']
['2', '2015-12-13', '0.93', '118220.22', '794.7', '109149.67', '130.5', '8145.35', '8042.21', '103.14', '0', 'conventional', '2015', 'Albany']
['3', '2015-12-06', '1.08', '78992.15', '1132', '71976.41', '72.58', '5811.16', '5677.4', '133.76', '0', 'conventional', '2015', 'Albany']
['4', '2015-11-29', '1.28', '51039.6', '941.48', '43838.39', '75.78', '6183.95', '5986.26', '197.69', '0', 'conventional', '2015', 'Albany']
['5', '2015-11-22', '1.26', '55979.78', '1184.27', '48067.99', '43.61', '6683.91', '6556.47', '127.44', '0', 'conventional', '2015', 'Albany']

And here's how you'd do it with pandas:

In [2]:
import pandas as pd

short_avo = pd.read_csv('data/avocado-short.csv')
print(short_avo)

    Unnamed: 0        Date  AveragePrice  Total Volume     4046       4225  \
0            0  2015-12-27          1.33      64236.62  1036.74   54454.85   
1            1  2015-12-20          1.35      54876.98   674.28   44638.81   
2            2  2015-12-13          0.93     118220.22   794.70  109149.67   
3            3  2015-12-06          1.08      78992.15  1132.00   71976.41   
4            4  2015-11-29          1.28      51039.60   941.48   43838.39   
5            5  2015-11-22          1.26      55979.78  1184.27   48067.99   
6            6  2015-11-15          0.99      83453.76  1368.92   73672.72   
7            7  2015-11-08          0.98     109428.33   703.75  101815.36   
8            8  2015-11-01          1.02      99811.42  1022.15   87315.57   
9            9  2015-10-25          1.07      74338.76   842.40   64757.44   
10          10  2015-10-18          1.12      84843.44   924.86   75595.85   
11          11  2015-10-11          1.28      64489.17  1582.03 

You can already see that pandas has saved us quite a bit of writing. It's also super fast for large datasets and it's very well documented so when you run into issues, you'll be able to find your way out pretty quickly. And believe you me... you will absolutely run into issues. Because even people who seem to know what they're talking about (✋) need to google the heck out of things sometimes.

## Interviewing Data

One of the first things you're going to want to do with a dataset is figure out what you've got. We do this by using different pandas functions that allow us to "interview" or get an overview of the dataset.

The data we used above `data/avocado-short.csv` is a 40 row slice of the full `data/avocado.csv` dataset that we'll be using here today. We used the short version above just so we wouldn't have to scroll for 9 years to get to the end of the data.

In [3]:
avo_data = pd.read_csv('data/avocado.csv')

### Interview Functions

`df.head()` - get the first 5 rows of your data

`df.tail()` - get the last 5 rows of your data

`df.sample(5)` - get a random sampling of 5 rows of your data

`df.columns` - get a list of all the columns

`df.info()` - get number of rows with data and data type for each column 

`df.shape` - get the number of rows and columns

`df.describe()` - get a variety of statistical calculations for all values in each column

Let's take these functions for a spin:

In [4]:
avo_data.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [5]:
avo_data.tail()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.2,0.0,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.7,3431.5,0.0,9264.84,8940.04,324.8,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.8,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.0,0.0,organic,2018,WestTexNewMexico
18248,11,2018-01-07,1.62,17489.58,2894.77,2356.13,224.53,12014.15,11988.14,26.01,0.0,organic,2018,WestTexNewMexico


In [6]:
avo_data.sample(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
13437,48,2016-01-24,2.09,28121.46,2289.5,9484.63,2221.2,14126.13,7033.2,7092.93,0.0,organic,2016,NewYork
6177,31,2017-05-28,1.24,182566.46,81187.49,20819.14,1809.22,78750.61,56749.77,21930.47,70.37,conventional,2017,Columbus
17475,31,2017-05-28,1.69,1302205.55,163384.37,357431.2,3195.15,778194.83,497848.02,280346.81,0.0,organic,2017,TotalUS
1466,10,2015-10-18,0.97,1856337.85,23873.63,1598365.27,1173.01,232925.94,181854.71,51069.15,2.08,conventional,2015,NewYork
6662,39,2017-04-02,1.26,166588.45,11375.5,61512.67,1452.17,92248.11,45777.12,46455.38,15.61,conventional,2017,Indianapolis


In [7]:
avo_data.columns

Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')

In [8]:
avo_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 14 columns):
Unnamed: 0      18249 non-null int64
Date            18249 non-null object
AveragePrice    18249 non-null float64
Total Volume    18249 non-null float64
4046            18249 non-null float64
4225            18249 non-null float64
4770            18249 non-null float64
Total Bags      18249 non-null float64
Small Bags      18249 non-null float64
Large Bags      18249 non-null float64
XLarge Bags     18249 non-null float64
type            18249 non-null object
year            18249 non-null int64
region          18249 non-null object
dtypes: float64(9), int64(2), object(3)
memory usage: 1.9+ MB


In [9]:
avo_data.shape

(18249, 14)

In [10]:
avo_data.describe()

Unnamed: 0.1,Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0
mean,24.232232,1.405978,850644.0,293008.4,295154.6,22839.74,239639.2,182194.7,54338.09,3106.426507,2016.147899
std,15.481045,0.402677,3453545.0,1264989.0,1204120.0,107464.1,986242.4,746178.5,243966.0,17692.894652,0.939938
min,0.0,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
25%,10.0,1.1,10838.58,854.07,3008.78,0.0,5088.64,2849.42,127.47,0.0,2015.0
50%,24.0,1.37,107376.8,8645.3,29061.02,184.99,39743.83,26362.82,2647.71,0.0,2016.0
75%,38.0,1.66,432962.3,111020.2,150206.9,6243.42,110783.4,83337.67,22029.25,132.5,2017.0
max,52.0,3.25,62505650.0,22743620.0,20470570.0,2546439.0,19373130.0,13384590.0,5719097.0,551693.65,2018.0


### Asking Questions

Armed with all this knowledge, we can start asking questions!

One of my first questions might be... what do all the columns mean? This dataset was thankfully presented with a [data dictionary](https://www.kaggle.com/neuromusic/avocado-prices/home).

`Date` - The date of the observation

`AveragePrice` - the average price of a single avocado

`Total Volume` - Total number of avocados sold

`4046` - Total number of avocados with PLU 4046 sold

`4225` - Total number of avocados with PLU 4225 sold

`4770` - Total number of avocados with PLU 4770 sold

`type` - conventional or organic

`year` - the year

`region` - the city or region of the observation

*We seem to be missing definitions for columns `[Total Bags, Small Bags, Large Bags, XLarge Bags]`, but we can probably deduce what those mean on our own.*
