# Working with data in Jupyter notebooks

### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

## 1. Introduction 

### What we will cover:
2. Setting up the environment

## 2. Setting up the environment

Using a virtual environment or a conda environment is crucial for ensuring reproducibility, consistency, and maintainability in your projects. Without an isolated environment, library installations and updates can affect your system-wide settings or other projects, leading to version conflicts and unpredictable behavior. By creating a dedicated environment for each project, you can precisely control which versions of Python and its packages are used, making it easier to replicate your results, share your work with others, and quickly recover a working setup if something goes wrong. This practice streamlines collaboration, simplifies troubleshooting, and ultimately helps maintain the integrity and reliability of your codebase.

- List the core libraries:
    - `pandas` for data manipulation and exploration.
        - [Docs](https://pandas.pydata.org/docs/)
    - `scikit-learn` for preprocessing and modeling
        - [Docs](https://scikit-learn.org/stable/)
    - `matplotlib` for basic plotting and data visualization
        - [Docs](https://matplotlib.org/stable/contents.html)
    - `PyTorch` for deep learning workflows (optional)
        - PyTorch is not a core library, but it is widely used for deep learning tasks. We will only touch on it briefly in this course, but you may want to explore it further if you are interested in deep learning. 
        - [Docs](https://pytorch.org/docs/stable/index.html)

In [2]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch

In [6]:
# Print our the versions of the libraries
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')
print(f'Matplotlib version: {plt.matplotlib.__version__}')
print(f'Scikit-learn version: {pd.__version__}')
print(f'PyTorch version: {torch.__version__}')
print(f'Acess to GPU: {torch.cuda.is_available()}') 

Pandas version: 2.2.2
Numpy version: 1.26.4
Matplotlib version: 3.9.2
Scikit-learn version: 2.2.2
PyTorch version: 2.5.1
Acess to GPU: True


## 3. Reading and Writing Data with Pandas

In this section, we will cover the fundamental steps for working with data files in Pandas, including how to:

- **Read data** from common file formats such as CSV.
- **Write processed data** back to disk in CSV format.
- Use **basic inspection methods** (such as `head()`, `info()`, and `describe()`) to quickly understand the structure and statistical properties of your dataset.

In [None]:
# Reading data
df = pd.read_csv('../data/house-prices/test.csv')  

# Displaying the first 5 rows
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [4]:
# .info() method to get a summary of the dataframe 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

In [5]:
# .describe() method to get a statistical summary of the dataframe

df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1459.0,1459.0,1232.0,1459.0,1459.0,1459.0,1459.0,1459.0,1444.0,1458.0,...,1458.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0
mean,2190.0,57.378341,68.580357,9819.161069,6.078821,5.553804,1971.357779,1983.662783,100.709141,439.203704,...,472.768861,93.174777,48.313914,24.243317,1.79438,17.064428,1.744345,58.167923,6.104181,2007.769705
std,421.321334,42.74688,22.376841,4955.517327,1.436812,1.11374,30.390071,21.130467,177.6259,455.268042,...,217.048611,127.744882,68.883364,67.227765,20.207842,56.609763,30.491646,630.806978,2.722432,1.30174
min,1461.0,20.0,21.0,1470.0,1.0,1.0,1879.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,1825.5,20.0,58.0,7391.0,5.0,5.0,1953.0,1963.0,0.0,0.0,...,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,2190.0,50.0,67.0,9399.0,6.0,5.0,1973.0,1992.0,0.0,350.5,...,480.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,2554.5,70.0,80.0,11517.5,7.0,6.0,2001.0,2004.0,164.0,753.5,...,576.0,168.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,2919.0,190.0,200.0,56600.0,10.0,9.0,2010.0,2010.0,1290.0,4010.0,...,1488.0,1424.0,742.0,1012.0,360.0,576.0,800.0,17000.0,12.0,2010.0


In [6]:
# Writing data to a csv file

df.to_csv('../data/tmp/processed_housing.csv', index=False)

## 4. Exploring and Visualizing Data

Content:

- Basic summary statistics.
- Identifying distributions of features.
- Simple visualizations (histograms, box plots, scatter plots) to understand data distribution, outliers, and relationships between variables.

## 5. Handling Missing Values and Outliers

Content:

- Techniques for detecting missing values (isnull().sum()) and outliers (using IQR or z-score).
- Strategies for handling missing data (drop vs. impute).
- Using sklearn.impute.SimpleImputer for numerical and categorical data.
- Discussion of domain knowledge in deciding how to handle anomalies.

## 6. Encoding Categorical Variables

Content:

- Importance of converting string labels into numeric form for modeling.
- One-hot encoding with pd.get_dummies() or sklearn.preprocessing.OneHotEncoder.
- Label encoding vs. one-hot encoding and when to use each.

## 7. Feature Scaling and Normalization

Content:
- Explain why features on different scales can negatively affect certain models.
- Show StandardScaler and MinMaxScaler from scikit-learn.
- Discuss when scaling is necessary (e.g., for neural networks or distance-based models).

## 8. Train-Test Split and Basic Data Pipelines

Content:
- Introduce the concept of splitting data into training, validation, and test sets.
- Show train_test_split usage from scikit-learn.
- Introduce basic pipeline concepts (sklearn.pipeline.Pipeline) to ensure consistent preprocessing and modeling steps.

## 9. Introduction to PyTorch Tensors

Content:
- Briefly show how PyTorch tensors differ from NumPy arrays and how to convert between them.
- This will be relevant for deep learning sessions later in the course.

## 10. Using Generative AI in this Course 

Content:
- Briefly introduce the concept of generative AI and how it can be used in this course.
- Mention that we will cover this in more detail in later sessions.
