# Introduction

**Environment preparation**

* Create Python environment
```
python -m venv env-name
```
* Run venv
```
env-name\Scripts\Activate
```
* Intall python packages
```
pip install jupyter jupyterlab pandas matplotlib requests seaborn scipy scikit-learn
```
* Run Jupyter lab
```
jupyter lab
```

**Jupyter lab**

* Add code
* Add text
* Execute command
* Shortcuts (a, b, dd, Ctrl+Enter, Shift+Enter, x, c, v)



**Alternatives**

* Google Colab ([Colaboratory](https://colab.research.google.com/))
* Python scripts in VS Code






# Data processing

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level

## Pandas tasks:
* Load data
* Standard data inspection (functions head(), tail(), describe(), isna(), shape)
* Select one attribute to variable 
    - Series and numpy compatibility
* dtype, index, columns
* Data selection - [], loc, iloc
* Data filtering and logical operators
* Add new column to dataframe
* Calculate new numberical attribute
* Data selection - comparison and negation
* Assign new values to selected rows from dataframe
* Use .apply() for rows and single column
* Use .groupby() for data aggregation

## Import used packages

In [None]:
import pandas as pd # dataframes
import numpy as np # matrices and linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns # another matplotlib interface - styled and easier to use

## The first step is to load the data into the Pandas DataFrame - in our case it is a csv file

## We shloud take a look on the data after loading so we know that everything is OK

### We will start with showing first/last N rows 
- There are several ways of doing that:
    - name of the dataframe
    - head()
    - tail()

### Show 5 first and last rows

### Show first 5 rows

### Show last 20 rows

## If we want to know if there are any missing values, the isna() function may render useful

## We can show summary of common statistical characteristic of the data using the describe() function

## Dataframe has several useful properties
    - shape
    - dtypes
    - columns
    - index

#### Row and column count

#### Datatypes of columns

#### Column names

#### Row index values

## We may want to work with just one column not the whole dataframe
- We will select only the SalePrice columns and save it to another variable

## Columns are called Pandas Series - it shares a common API with Pandas DataFrame
- Pandas is numpy-backed so we can use Series as standard numpy arrays without any issues using the .values property

## e.g. find maximum price using Numpy and Pandas

## Data filtering using Pandas DataFrame
- There are several ways of filtering the data (similar logic to .Where() in C# or WHERE in SQL)
- We usually work with two indexers - .loc[] and .iloc[]

### The .iloc[] indexer works with positional indexes - very close to the way of working with the raw arrays
### The .loc[] indexer works with column names and logical expressions

### Select all rows and 3rd column of dataframe

### Select all rows and LAST column of dataframe

### Select rows 15 to 22 and all columns

### Select rows 15 to 22 and 3rd column

## Select only a subset of columns to a new dataframe
- 'SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr'

### Select only houses built in year 2000 or later

### Select only houses that don't use GasA for heating (try != and ~ operators)

### Select houses cheaper than 180k USD and with at least 2 bedrooms

### Select houses with 2 stories or air conditioning

## We can add new columns to the DataFrame as well

### Add a new column named Age for each house (current year - year built)

### Add a new column IsLuxury with True value for houses with more than 3 bedrooms and price above 214k USD (.loc)
- How many luxury houses are in the dataset?

### BONUS: We may apply function to each row of the dataset and compute the value that way as well

## Pandas enables us to use aggregation functions for the data using the .groupby() function

### Compute counts for all the heating methods (groupby / value_counts)

# Visualization

## Scatter plot
- Visualize relationship between SalePrice and GrLivArea.Use scatter plot from **Matplotlib**.

### Modify figure size and add title

### Add axis labels

### Modify one of axis to logaritmic scale

### Add colors for data points based on CentralAir value.

### Try to use scatterplot from **Seaborn** library for scatter plot visualization.

#### Use series data for axes x and y

### Use dataframe as source and column names for axes data

### Resize plot and add color for markers based on CentralAir column

### Set color palette to *binary* scheme

## Line plot
- Calculate and vizualize average house price in relationship to YearBuild.

### Create dataframe from the previous calculation and vizualize using Seaborn line plot (note: use .reset_index())

## Bar plot
- Calculate and vizualize how many of houses have CentralAir
- Use Barplot for visualization

### Vizualize number of building type and if they contains air conditioning using Seaborn

# Tasks
## Pandas (1 pt)
Add a new column *Undervalued* which is set to True in case that the house is priced bellow 163k USD and has both OverallQual and OverallCond higher than 5.

How many undervalued houses are in the dataset?

## Vizualization (1 pt)
Add to dataframe new attribute determining if the house was build before or after year 2000.

Create bar chart for number of houses depending on type of dwelling (attribute BldgType, use as a category axis) and added binary attribute about house age (use as a bar color).