# Pandas - Example

In [None]:
import pandas as pd
import numpy as np

## Loading and inspecting data
- `df = pd.read_csv('data/pandas_example.csv')`
- `.shape`
- `.head()` / `.tail()`
- `.info()`
- `.describe()`
- `.sort_values('units', ascending=False)` 

# Extracting subsets
- `iloc[]`. 
    - Try with single values and with ranges (`[n:m]`).
    - Try in two dimensions

## Filter data
Select rows that meet a certain criteria.

- find all houses with more than 800 units 

# Select columns
- columns are "Series"
- Select a single column
- Select a set of columns

## Remove columns
We won't use the `id` and `easmenet` column. Let's drop them
- use `drop(columns=['id', 'easement'])`

# Rename a column
- `df_homes.rename(columns = {'borough':'district'})`

# Convert data types
- check `df_homes.info()`. Note that the `gross_sqft`, `land_sqft` and the `sale_price` are not numeric values. This is because they have hardcoded nodata values which `read_csv()` did not know how to deal with. Let's convert those columns to numeric values using `pd.to_numeric()`
- check `df_homes.info()` again

# Data Cleaining
## Dealing with `null` values
- use `isnull()`
- combine it with the aggregate function `sum()`

## Replace NULL values.
- we could replace all NA values to a fixed value: `df_homes.fillna(500)`
- Alternatively, we can simply remove all rows that have NA using `dropna()`

## Dealing with duplicates
- Check to see if there are any duplicate records using `duplicated()`
    - Note that we could specify the columns 
- Drop the duplicates using `drop_duplicates()`. Specify `keep` kwarg

## Removing data
- Remove houses with a sale price of 0
- Remove houses with a year build 0

## Adding columns
- Compute the building age and add it as a new column

## Remove outliers
- Remove properties that have SQFT greater than 10,000.


# Replace the district/bouroughs numbers with their actual name.
- https://en.wikipedia.org/wiki/Boroughs_of_New_York_City

# Data analysis
-  Count frequencies in a categorical column. `df_homes['neighborhood'].value_counts()`

# Sample some homes
randomly select two rows

# Grouping
- find the average sale price for each district using groupby

# Pivot 

# Visualize
- plot the `gross_sqft` against tge `land_sqft`
- make a histogram of the `land_sqft`