<a href="https://colab.research.google.com/github/stefanlessmann/ASE-ML/blob/main/Day-1-Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wecome to our Section 10 Deep Learning Practice Sessions
Today, we introduce Google Colab, the tool we will use in Section 10 - 14 to illustrate selected concepts and for practical exercises. 

In a nutshell, Colab is a web browser-based tool to develop Python notebooks. Apart from providing support to write Python codes and interact with the Python ecosystem, Colab also allows you to execute your code on the Google Cloud.

The big advantage for our course is that you do not need to install any software to your computer. Therefore, using Colab is a very simple and user-friendly way to start with Python and Machine Learning. If you'd like to know more, please have a look at the [comprehensive introduction by the Google Research](https://colab.research.google.com/).

The outline of today's session is as follows:
- Brief overview of key functions in Colab
- Working with Python libraries
- Generating synthetic data for supervised learning using the `sklearn` library
- Loading demo data from GitHub
- Eyeballing data using  the `Pandas` library 

## Working with Python libraries
The beauty of Python for machine learning stems from the fact that several powerful, easy-to-use libraries are available to perform all sorts of machine learning tasks. Prior to using the methods in some library, you need to *import* the corresponding library. The following example illustrates how to do this by importing some common libraries for machine learning.
<br><br>
- Run the code in the following cell. 
- Next, explore how you can use the methods provided in the libraries and how Colab supports you in that quest. Create a new coding cell and type `np.` After a little while, Colab should display a tool tip, which informs you what methods are available. Their names might appear somewhat cryptic at first glance. - Select the method `arange()` from the list. Your code cell should now comprise the statement `np.arange()`. 
- Move the mouse over this statement to explore the description of the method, also called *docstring* in Python. Check whether you find the docstring informative. What functionality does the `arange` method provide?


In [1]:
# Standard libraries for data data handling and plotting
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


# Generating synthetic data for supervised learning using the sklearn library
Supervised learning requires labeled data. Typically, such data stems from corporate data warehouses and was gathered during day-to-day business operations. For illustrative purpose, Python makes it easy to generate synthetic data for supervised learning.

- Run the following code cell to import the method `make_regression()`, which is part of the `sklearn` library.
- Explore the docstring of the method and try using it for creating a synthetic data set of supervised learning, more specifically, regression.
- Use the method to create a *univarite regression* problem. The resulting, synthetic data should comprise a dependent variable and one independent variable. 
- Create a 2D chart of your dependent against your independent variable.
- Optional: play a bit with the arguments of the `make_regression()` to alter your data. Re-create the plot and inspect how it has changed. 
- Run a web search for "python fit regression model". This search will provide many suggestions of how you can write some Python code to fit a linear regression model to your data. Try it out by fitting a regression model to your synthetic data. 




In [None]:
# This code imports the method make_regression that we need for this exercise
from sklearn.datasets import make_regression


## Loading demo data from GitHub
The data data set we use for this session is available at the following URL: <br>
https://raw.githubusercontent.com/stefanlessmann/ASE-ML/master/hmeq.csv

Create a variable named `data_url` in which you store this URL. Next, use the method `read_csv()` from the `Pandas` library to download the data right from the web and store it in a `DataFrame`. 

In [2]:
# Enter code to download the demo data set from the web 


### Introducing the HMEQ data set (you can skip this section)
Our data set, called the  "Home Equity" or, in brief, HMEQ data set, is provided by www.creditriskanalytics.net. It comprises  information about a set of borrowers, which are categorized along demographic variables and variables concerning their business relationship with the lender. A binary target variable called 'BAD' is  provided and indicates whether a borrower has repaid her/his debt. You can think of the data as a standard use case of binary classification.

You obtain the data, together with other interesting finance data sets, directly from www.creditriskanalytics.net. The website also provides a brief description of the data set. Specifically, the data set consists of 5,960 observations and 13 features including the target variable. The variables are defined as follows:

- BAD: the target variable, 1=default; 0=non-default 
- LOAN: amount of the loan request
- MORTDUE: amount due on an existing mortgage
- VALUE: value of current property
- REASON: DebtCon=debt consolidation; HomeImp=home improvement
- JOB: occupational categories
- YOJ: years at present job
- DEROG: number of major derogatory reports
- DELINQ: number of delinquent credit lines
- CLAGE: age of oldest credit line in months
- NINQ: number of recent credit inquiries
- CLNO: number of credit lines
- DEBTINC: debt-to-income ratio

As you can see, the features aim at describing the financial situation of a borrower. We will keep using the data set for many modeling tasks in this demo notebook and future demo notebook. So it makes sense to familiarize yourself with the above features. Make sure you understand what type of information they provide and what this information might reveal about the risk of defaulting.  

# Eyeballing data using the Pandas library
In this part, we will demonstrate some commonly used methods to explore our data and form some hypotheses about the relationship between independent variables and the target variable. 

In [None]:
# Exploring the HMEQ data set
