-   Tyler Arista, tja9

# Instructions for today's practice

- Create a copy of this Jupyter Notebook and share it with your partner.
- Fill student names and e-mails in the text cell above.
- At the end of the practice, download the .ipynb file and upload it on Moodle.

![](https://images.unsplash.com/photo-1629493502193-0168c9f65376?q=80&w=2071&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

# Dataset: species and habitats

Our data simulates an environmental research where different **species** were observed at several research sites. The dataset contains both numerical and categorical features, and the goal is to classify the **species type** based on these attributes.

### **Dataset Columns:**
1. **Observation ID**: Unique identifier for each observation.
2. **Site ID**: Identifier for the research site (e.g., `Site-001` to `Site-050`).
3. **Elevation (m)**: Elevation of the site in meters (ranging from 500 to 4000).
4. **Average Temperature (°C)**: Mean temperature at the site (from -10°C to 30°C).
5. **Soil Type**: Type of soil at the site (`Sandy`, `Clay`, `Loamy`, `Peaty`, `Chalky`).
6. **Habitat Type**: Habitat type of the site (`Forest`, `Grassland`, `Wetland`, `Desert`).
7. **Precipitation (mm/year)**: Yearly precipitation in millimeters (200 to 3000).
8. **Species Type**: The target variable representing the type of species observed (`Plant`, `Insect`, `Bird`, `Mammal`, `Reptile`).

**📝 Exercise 1**: Load the generated natural science dataset and perform basic exploratory data analysis.
   - Import necessary libraries.
   - Load the dataset into a dataframe: url is [https://cs.calvin.edu/courses/data/202/fsantos/datasets/species_habitat.csv](https://cs.calvin.edu/courses/data/202/fsantos/datasets/species_habitat.csv).
   - Display the first few rows and the summary statistics (`.info()`)
   - Check for the data types of each column (categorical/numerical, ordered/unordered)

In [4]:
import pandas as pd

df = pd.read_csv('https://cs.calvin.edu/courses/data/202/fsantos/datasets/species_habitat.csv')

df.head()

Unnamed: 0,Elevation (m),Average Temperature (°C),Precipitation (mm/year),Soil Type,Habitat Type,Species Type
0,475.357153,8.299849,3795.975453,Peaty,Forest,Plant
1,1256.272452,6.674172,657.18059,Sandy,Grassland,Mammal
2,969.909852,32.48664,106.169555,Chalky,Wetland,Reptile
3,1912.726729,15.495129,1579.862547,Loamy,Forest,Bird
4,1119.680235,0.933313,3897.646523,Sandy,Desert,Mammal


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Elevation (m)             1000 non-null   float64
 1   Average Temperature (°C)  1000 non-null   float64
 2   Precipitation (mm/year)   1000 non-null   float64
 3   Soil Type                 1000 non-null   object 
 4   Habitat Type              1000 non-null   object 
 5   Species Type              1000 non-null   object 
dtypes: float64(3), object(3)
memory usage: 47.0+ KB


In [6]:
df.dtypes

Unnamed: 0,0
Elevation (m),float64
Average Temperature (°C),float64
Precipitation (mm/year),float64
Soil Type,object
Habitat Type,object
Species Type,object


**📝 Exercise 2**: separate the **features and the target variable** for classification.
   - Split the dataset into `X` (features) and `y` (target).
   - Print the shapes of `X` and `y`.
   - Identify which columns are **categorical** and which are **numerical**.

In [7]:
x = df.drop(columns = 'Species Type')
y = df['Species Type']

print("Shape of X: ", x.shape)
print("Shape of y: ", y.shape)

categorical_cols = x.select_dtypes(include=['object']).columns
numerical_cols = x.select_dtypes(include=['int64', 'float64']).columns

print("Categorical columns: ", categorical_cols)
print("Numerical columns: ", numerical_cols)

Shape of X:  (1000, 5)
Shape of y:  (1000,)
Categorical columns:  Index(['Soil Type', 'Habitat Type'], dtype='object')
Numerical columns:  Index(['Elevation (m)', 'Average Temperature (°C)', 'Precipitation (mm/year)'], dtype='object')


**📝 Exercise 3**: **split training and testing sets** for model development.
   - Use `train_test_split` from scikit-learn to split the dataset into **80% training** and **20% testing**.
   - Set a random seed for reproducibility.
   - Print the number of observations in each set.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print("Training set size: ", X_train.shape[0])
print("Testing set size: ", X_test.shape[0])

Training set size:  800
Testing set size:  200


**📝 Exercise 4**: Build **pipelines to handle numerical and categorical features** separately.
   - For **Numerical Features**: Use `StandardScaler` to standardize the features.
   - For **Categorical Features**: Use `OneHotEncoder` with `drop='first'` to encode the categorical columns.
   - Combine these two pipelines using `ColumnTransformer` from `sklearn.compose`.

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numerical_pipeline = StandardScaler()
categorical_pipeline = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ])

**📝 Exercise 5**: Build a **complete pipeline** that includes both the preprocessing and the k-NN classifier.
   - Import `KNeighborsClassifier` from `sklearn.neighbors`.
   - Create a complete pipeline that includes the `ColumnTransformer` and a `KNeighborsClassifier` with `n_neighbors=5`.
   - Fit the model using the training set (`X_train` and `y_train`).

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
])

knn_pipeline.fit(X_train, y_train)


**📝 Exercise 6**: Use the trained model to **make predictions** on the test data.
   - Use the pipeline to make predictions on `X_test`.
   - Store the predictions in a variable called `y_pred`.
   - Print the first 10 predictions alongside the actual labels to compare.

In [12]:
y_pred = knn_pipeline.predict(X_test)

print("Predictions:", y_pred[:10])
print("Actual labels:", y_test[:10].values)

Predictions: ['Bird' 'Plant' 'Mammal' 'Bird' 'Bird' 'Mammal' 'Insect' 'Mammal' 'Plant'
 'Bird']
Actual labels: ['Insect' 'Mammal' 'Bird' 'Bird' 'Bird' 'Mammal' 'Insect' 'Mammal' 'Plant'
 'Bird']


**📝 Exercise 7**: measure the performance of the k-NN classifier using appropriate metrics.
   - Import `classification_report` and `confusion_matrix` from `sklearn.metrics`.
   - Print the **classification report** and **confusion matrix**. (If you want, use previous code to plot this confusion matrix)
   - Calculate and print the overall **accuracy**.

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Classification Report:\n", classification_report(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy:.2f}")

Classification Report:
               precision    recall  f1-score   support

        Bird       0.73      0.79      0.76        42
      Insect       0.79      0.81      0.80        37
      Mammal       0.67      0.42      0.51        48
       Plant       0.75      1.00      0.86        36
     Reptile       0.87      0.92      0.89        37

    accuracy                           0.77       200
   macro avg       0.76      0.79      0.76       200
weighted avg       0.76      0.77      0.75       200

Confusion Matrix:
 [[33  1  7  0  1]
 [ 1 30  3  0  3]
 [11  4 20 12  1]
 [ 0  0  0 36  0]
 [ 0  3  0  0 34]]
Overall Accuracy: 0.77


**📝 Reflection Exercise**: Write a sentence or two of your overall
reflections on this practice. You may write whatever you want, but you
might perhaps respond to one or two of these questions:

-   Was anything unclear about this assignment?
  - No, there wasn't anything unclear about this assignment. I thought everything in this notebook were topics that we covered in class.
-   How hard was it for you? Where did you get “stuck”?
  - I think this notebook wasn't more difficult than the last notebook but still made me think about how to approach the problem & made me think.
-   How long did it take you?
  - I would say it took me about 45 min - 1 hour
-   What questions or uncertainties remain?
  - I don't have any remaining questions or uncertainties
-   What skills do you think you’ll need more practice with?
  - I think more practice in general is helpful to further apply & strength the skills we've learned
-   Did you try anything out of curiosity that you weren’t specifically
    asked to do?
      - I did not try anything outside the assignment instructions this time