# Intro to Pandas
[pandas](https://pandas.pydata.org/) is the primary Python library for doing basic data analysis. If you are a data scientist, much of your life will be spent manipulating data in pandas. pandas provides a nice layering on top of NumPy to make data analysis much easier. In particular, the primary data structure, the **DataFrame** provides labels for both the rows and the columns. This makes for much easier access to the elements within.

### Iowa Housing Dataset
We will use the [Iowa Housing Dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) from Kaggle, a dataset of ~1500 homes in Ames, Iowa, with fields like the Sale Price, Area of floors, No. of rooms, Sale Date etc. Make sure to check out the [data description file](datasets/iowa_housing_data_description.txt) to get an idea of the kind of fields included in the dataset.

In [None]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (9.0, 5.0)

import matplotlib.pyplot as plt
# It's a convention to import pandas as pd
import pandas as pd

pd.set_option('display.width', 1000)

In [None]:
# Create a DataFrame by reading in a csv file
data = pd.read_csv('datasets/iowa_housing.csv')
# Like Numpy arrays, DataFrames have a 'shape' (# of rows, # of columns)
print(data.shape)

# Show the first few rows of the DataFrame
data.head()

# Components of the DataFrame
The vast majority of an analysis takes place inside a DataFrame. There are three components to a DataFrame, the **index**, the **columns** and the **data** or **values**. The index labels the rows, the column names label the columns and the data are the actual values that we manipulate during an analysis.

![anatomy](dataframe_anatomy.png)

In [None]:
# There are 3 main components of a DataFrame
#   1. The index - tells us how to locate ("index into") rows (axis 0).
#   2. The columns - tell us how to locate ("index into") columns (axis 1).
#   3. The values - the data itself

print(data.index)     # An Index object that allows for fast searching
print(data.columns)   # An Index object that allows for fast searching
print(data.values)    # Note: What is the type of the 'values'? A numpy array!

In [None]:
# We can ask the DataFrame to 'describe' itself to get a general idea about the column ranges
data.describe()

In [None]:
# The 'head' method gives us the first 5 rows of the DataFrame
# We could also obtain the first 5 rows by indexing into the DataFrame using iloc
# Remember this slicing syntax from before?
data.iloc[0:5, :]

In [None]:
# Looks like there is already an 'Id' column in our data
# This is a more 'natural' index into the data than the one that was
# created when we imported the data.
# So let's set the index of our data to this columns
data = data.set_index('Id')
data.head()

In [None]:
# Getting a subset of the columns of the DataFrame
data[['LotArea', 'GrLivArea', 'SalePrice']]

# Categorical vs Continuous
Data can be categorized into two broad types. Data that is discrete and countable is called **categorical**. These variables usually have strings as values but sometimes numeric values like year or age may be considered categorical. **Continuous** variables on the other hand are always numeric and are uncountable.

## Selecting Single Columns of Data - A Series
Each column of data may be selected with the indexing operator [] and a passed string name. A pandas **Series** is a single dimensional data structure with an index and values. It has no columns.

In [None]:
# Grab single columns
sale_price = data['SalePrice']
print(type(sale_price))
sale_price.head(10)

In [None]:
# Let's sort the Series object by it's values so we can get an idea of the most expensive homes
sale_price = sale_price.sort_values(ascending=False)
sale_price.head(10)

In [None]:
# We can describe a Series object just like we described the DataFrame
sale_price.describe()

## Plotting directly from pandas
DataFrames conveniently provide a plot method to directly plot without using matplotlib.

In [None]:
# Plot a histogram of SalePrice data
data['SalePrice'].plot(kind='hist')

In [None]:
# Plot a Scatter plot of GrLivArea (Living Area over grade) vs. the Sale Price
data.plot(kind='scatter', x='GrLivArea', y='SalePrice')

### Counting the values of categorical data
The **`value_counts`** method is valuable for getting an idea of the distribution of categorical variables.

In [None]:
neighborhoods = data['Neighborhood']
neighborhoods.value_counts().head(10)

## Filtering Data
We can use the indexing operator [] and *Boolean Indexing* to zero on the data that we want.

In [None]:
# Observe what we get back when we do a comparison like this:
data['SalePrice'] < 300000
# We can use this Boolean Series (with an index identical to that of the DataFrame) to index into the the DataFrame
# This is called Boolean Indexing, and works very similar to Boolean Indexing in Numpy
affordable_data = data[data['SalePrice'] < 300000]
affordable_data.shape

In [None]:
# Let's say we're only interested in a subset of the data, both in terms of rows and columns

# We're only interested in Cheap Homes built on or after 2005
data_subset = data[(data['SalePrice'] < 300000) & (data['YearBuilt']>=2005)]
# We're only interestd in the Neighborhood/Living Area/Sale Price columns
data_subset = data_subset[['Neighborhood', 'GrLivArea', 'SalePrice']]
data_subset.shape

In [None]:
# Let's find out the mean of all the columns in our restricted dataset
# Note the 'axis' parameter here - we get the mean for all the rows
print(data_subset.mean(axis=0))

# What would the following give us?
# print(data_subset.mean(axis=1))

## Grouping and Aggregating
One of the most common operations during an analysis is to divide the data into groups and aggregate some other dimension of data. For instance, to calculate the average sale price by neighborhood we can do the following.

In [None]:
mean_neighborhood_prices = data_subset.groupby('Neighborhood').mean()
print(mean_neighborhood_prices)

## Sorting Data
Both DataFrame and Series object can be sorted by using the *sort_values* method, which returns a new DataFrame or Series object.

In [None]:
# Let's get the most expensive neigborhoods corresponding to our search criteria.
print(mean_neighborhood_prices.sort_values(by=['SalePrice'], ascending=False))