<a href="https://colab.research.google.com/github/vleon777/Python_Basics/blob/main/week3_first_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 – First Practical Project: Simple Prediction with scikit-learn

In this project you will build a **basic machine learning model** using the library **scikit-learn**.  
We will work with a small dataset (e.g., housing prices or sales).  
Info:
pandas = manage the dataset (tables).

numpy = math under the hood.

scikit-learn = apply machine learning models.
---

Part 1: Setup
1. Import the libraries you need (`pandas`, `numpy`, `scikit-learn`).
2. Load a dataset:
   - Option A: Use `sklearn.datasets.load_diabetes()` (built-in dataset, easy to start).
   - Option B: Load a CSV file from an online source (like housing data).
3. Inspect the dataset: print the first 5 rows, check column names, and understand the variables.

---
 Part 2: Data Preparation
1. Choose one column as your **feature** (input, X) and one column as your **target** (output, y).
   - Example: predict house price (`y`) based on number of rooms (`X`).
2. Split your data into **training set** and **test set** (use `train_test_split` from sklearn).
3. Print the shapes of train and test sets.

---

 Part 3: Build the Model
1. Import a model, for example `LinearRegression` from sklearn.
2. Create the model object.
3. Fit (train) the model with your training data.

---
 Part 4: Make Predictions
1. Use the trained model to predict values for your test set.
2. Print the first 10 predictions vs. the true values.

---
 Part 5: Evaluate the Model
1. Import an error metric, for example `mean_squared_error` or `r2_score`.
2. Calculate and print the error of your model.
3. Write a short explanation: is the error high or low?

---

Part 6: Deliverable
- Save this notebook as `week3_first_project.ipynb`.
- Upload it to your GitHub repository.
- Add a short README in GitHub explaining what the project does.


In [None]:
#Import all data needed for project
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

# Load the dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame   # Convert to pandas DataFrame

# Look at the first rows
df.head(5)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
# Choose features (X) and target (y)
X = df[['AveRooms']] # Using AveRooms as the feature
y = df['MedHouseVal'] # Using MedHouseVal as the target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the train and test sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (16512, 1)
Shape of X_test: (4128, 1)
Shape of y_train: (16512,)
Shape of y_test: (4128,)


In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Predictions:", y_pred[:10])
print("Actual values:", y_test[:10].values)
# Tests to see if model works
# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# R² Score
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)

Predictions: [1.97653709 2.04156313 1.96003112 2.12785581 2.07638001 2.05784293
 2.13231401 2.03568685 1.95600167 2.18276827]
Actual values: [0.477   0.458   5.00001 2.186   2.78    1.587   1.982   1.575   3.4
 4.466  ]
Mean Squared Error: 1.2923314440807299
R² Score: 0.013795337532284901


Error Metrics

Mean Squared Error (MSE): 1.29
This means that on average, your predictions are about √1.29 ≈ 1.13 units off.

R² Score: 0.013
This is very low — it means your model explains only ~1.3% of the variance in the data. Essentially, the model is not much better than guessing the average.

 What does this mean?

The model is underfitting — it’s too simple to capture the complexity of the data.

Linear Regression with just one feature may not be enough.

Use more features (not just one column).
