In [None]:
import numpy as np
import pandas as pd

# Intro to Pandas

# 1. Load data

We are using house sale price dataset that can be obtained from Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/description

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, the competition challenges you to predict the final price of each home.

## Tasks:
1.1. Load `train.csv` file using `pd.read_csv()` function.

1.2. Print top 10 and last 10 observations in the table using `.head()` and `.tail()`

1.3. Print all the data columns names using method `.columns`

1.4. Print number of rows and columns using method `.shape`

1.5. You may also want to increase maximal displayed pandas columns: set `pd.options.display.max_columns` to 30

# 2. Data exploration

## Tasks:
2.1. Use pandas `.describe()` to display basic statistic about the data.

2.2. Use methods `.min()`, `.max()`, `.mean()`, `.std()` to display specific statistics about the data.

2.3. Count number of unique values in every column `.nunique()`. What does this tells you about the features, which are most likely categorical and which are most likely numerical?

2.4. Use method `.count()` to count the number of non-NA cells in each column. Are there any missing values in the data? 
Missing values can be imputed with a mean value, dummy value or based on some other logic depending on the feature using `.fillna()` method.

2.5. Use method `.dtypes` field to display data types in columns. What are the columns with dtype int64?

2.6. Use method `.value_counts()` to count number of unique values in a specific column.

# 3. Data selection

In pandas.DataFrame you can select

1. Row/s by position (integer number \[0 .. number of rows - 1\]) `.iloc` or by DataFrame.index `.loc`:
```
data.loc[0]
data.loc[5:10]
data.iloc[0]
data.iloc[5:10]
```

2. Columns by name
```
data[columname]
```
3. Row/s and columns
```
data.loc[10, columname]
data.iloc[10, columname]
```
4. Using boolean mask
```
data[data[columname] > value]
```
You can combine multiple conditions using `&` or `|` (and, or)

```
cond1 = data[columname1] > value1
cond2 = data[columname2] > value2
data[cond1 & cond2]
```
5. Using queries `.query()`:
```
value = 5
data.query("columname > value")
```
You could combine multiple conditions using `and`, `or`

```
data.query("(columname1 > value1) and (columname2 > value2)")
```
and others. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html for more examples.


## Tasks:

3.1. How many bedrooms does a house in row 7 have?

3.2. How many houses has 3 kitchens?

3.3. What is the percentage of houses built earlier than 1970?

3.4. When was built the most expensive house?

3.5. What roof style has a house built in 2005 with a central air conditioning and 11911 sqft lot size?

3.6. What is the median lot size in the most popular zone?

# 4. Groupby
from the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

By “group by” we are referring to a process involving one or more of the following steps:

    - Splitting the data into groups based on some criteria.
    - Applying a function to each group independently.
    - Combining the results into a data structure.
    
---
`.groupby()` is one of the most powerfull tool for feature engineering. Very often it is used to group object with the same categorical characteristics and compute some statistics (e.g. mean, max, etc.) of a their numerical characteric. 

## Tasks
4.1. Compute mean remodel date (`YearRemodAdd`) for each overall condition (`OverallCond`)

4.2. Compute min and max price for each date (MM.YYYY)

4.3. Create a new feature `StyleArea` for `df_train` indicating minimum above ground living area (`GrLivArea`) within the group of houses with specific `RoofStyle`, `Foundation`, `Heating`, and `GarageType`. 

# 5. Data visualisation

In [None]:
import seaborn as sns
sns.set(font_scale=1.2, style="whitegrid", palette='magma')
import matplotlib.pyplot as plt

## Tasks

5.1. Plot number of missing values as pandas `bar` plot

5.2. Plot target variable distribution using `sns.distplot`

5.3. Visualise feature correlation matrix using `sns.heatmap`

5.4. Use `sns.boxplot` to show sale price variability within each OverallQual category

5.5. Study relationship between price and GrLivArea feature (above grade (ground) living area square feet) using `scatter` plot

5.6. Use `sns.pairplot` to visialise pairwise relations for 'SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', and 'YearBuilt'.

5.7. Use `sns.FacetGrid` to create the following figure <img src="FacetGrid.png" width="600">