# Loading data with Scikit-learn 

In this notebook, we will review how to load data before any machine learning takes place.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro-v2/blob/main/notebooks/01-loading-data.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro-v2/main/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

## Generated datasets

### Regression

In [None]:
from sklearn.datasets import make_regression

X, y = make_regression()

In [None]:
X

In [None]:
y

### Classification

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification()

In [None]:
X

In [None]:
y

### Sample Datasets

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
penguins = fetch_openml(data_id=42585, as_frame=True, parser="pandas")

In [None]:
print(penguins.DESCR)

In [None]:
X = penguins.data
y = penguins.target

In [None]:
X

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X['culmen_length_mm'], X['culmen_depth_mm'], c=y.cat.codes)

In [None]:
penguins_df = penguins.frame

In [None]:
penguins_df

In [None]:
import seaborn as sns
sns.set_theme(font_scale=1.3)

In [None]:
sns.relplot(data=penguins_df, x='culmen_length_mm', y='culmen_depth_mm',
            hue='species', height=6);

In [None]:
sns.displot(data=penguins_df, x="culmen_length_mm", hue="species", kind="kde", aspect=2);

In [None]:
sns.jointplot(data=penguins_df, x="culmen_length_mm", y="culmen_depth_mm", hue="species", height=8);

## Exercise 1

1. Load the wine dataset from the `sklearn.datasets` module using the `load_wine` function and with `as_frame=True`.
2. Print the description of the dataset.
3. What is the number of samples and features in this dataset?
4. Is this a classifiation of a regression problem? Hint: The target column is called `target`.
5. Use `sns.jointplot` to explore the relationship between the `proline` and `flavanoids` features. (Be sure to set `hue` to the target name)

In [None]:
from sklearn.datasets import load_wine

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro-v2/blob/main/notebooks/solutions/01-ex1-solution.py). 

In [None]:
# %load solutions/01-ex1-solution.py