<a href="https://colab.research.google.com/github/veerchavan62/model/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Make predictions on the preprocessed test set
y_pred = model.predict(X_test_processed)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R2): {r2:.2f}")

Mean Absolute Error (MAE): 50670.49
Mean Squared Error (MSE): 4908290571.35
Root Mean Squared Error (RMSE): 70059.19
R-squared (R2): 0.63


In [7]:
print("Missing values per column:")
print(df.isnull().sum())
print("\nTotal missing values in the dataset:")
print(df.isnull().sum().sum())

Missing values per column:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Total missing values in the dataset:
207


# Task
Create a model by loading a dataset, preprocessing the data, defining the model architecture, training the model, and evaluating the model.

## Load the dataset

### Subtask:
Load the dataset into the environment. This might involve reading from a file (e.g., CSV, JSON, database) or loading an existing dataset from a library.


**Reasoning**:
Load the dataset into a pandas DataFrame from the specified CSV file and display its basic information to understand its structure and content.



In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/data/Housing.csv')

# Display the first few rows
print("First 5 rows of the DataFrame:")
print(df.head())

# Display basic information about the DataFrame
print("\nDataFrame Info:")
df.info()

# Display descriptive statistics of the DataFrame
print("\nDescriptive Statistics:")
print(df.describe())

FileNotFoundError: [Errno 2] No such file or directory: '/data/Housing.csv'

**Reasoning**:
The previous code failed with a `FileNotFoundError` because `/data/Housing.csv` was not found. To fix this, the code will now download a publicly available 'Housing.csv' dataset from a common URL to the Colab environment's `/content/` directory before loading it into a pandas DataFrame and displaying its basic information.



In [2]:
import pandas as pd
import requests
import os

# Define the URL for the dataset and the local path to save it
housing_url = 'https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv'
local_file_path = '/content/Housing.csv'

# Check if the file already exists to avoid re-downloading
if not os.path.exists(local_file_path):
    print(f"Downloading Housing.csv from {housing_url} to {local_file_path}...")
    try:
        response = requests.get(housing_url, stream=True)
        response.raise_for_status() # Raise an exception for HTTP errors
        with open(local_file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print("Download complete.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading file: {e}")
        # If download fails, the pd.read_csv will likely fail, which is handled below.
else:
    print(f"File already exists at {local_file_path}, skipping download.")

# Load the dataset
try:
    df = pd.read_csv(local_file_path)

    # Display the first few rows
    print("\nFirst 5 rows of the DataFrame:")
    print(df.head())

    # Display basic information about the DataFrame
    print("\nDataFrame Info:")
    df.info()

    # Display descriptive statistics of the DataFrame
    print("\nDescriptive Statistics:")
    print(df.describe())
except FileNotFoundError:
    print(f"Error: The file '{local_file_path}' was not found after attempting download. Please ensure the file is accessible or provide the correct path.")
except Exception as e:
    print(f"An unexpected error occurred while loading or processing the CSV: {e}")


Downloading Housing.csv from https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv to /content/Housing.csv...
Download complete.

First 5 rows of the DataFrame:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6

## Preprocess the data

### Subtask:
Clean, transform, and prepare the data for model training. This includes handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.


**Reasoning**:
To preprocess the data, I will first separate the target variable from the features, then split the data into training and testing sets. After that, I will identify numerical and categorical columns and apply appropriate preprocessing steps using a ColumnTransformer, including median imputation and standardization for numerical features and one-hot encoding for categorical features.



In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Separate the target variable from the features
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# 2. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# 3. Identify numerical columns and 4. categorical column
# Exclude 'ocean_proximity' from numerical features
umerical_cols = X.select_dtypes(include=['float64', 'int64']).columns.drop('ocean_proximity', errors='ignore').tolist()
categorical_cols = ['ocean_proximity']

print("Numerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)

# 5. Create a ColumnTransformer for preprocessing

# Pipeline for numerical features: impute missing values with the median, then scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical features: one-hot encode
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# 6. Apply the ColumnTransformer to both the training and test feature sets
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Display shapes of processed data to confirm transformation
print("\nShape of processed X_train:", X_train_processed.shape)
print("Shape of processed X_test:", X_test_processed.shape)

Shape of X_train: (16512, 9)
Shape of X_test: (4128, 9)
Shape of y_train: (16512,)
Shape of y_test: (4128,)


NameError: name 'numerical_cols' is not defined

**Reasoning**:
The previous code failed because of a `NameError`: the variable `numerical_cols` was misspelled as `umerical_cols` during its definition. I will correct the spelling to `numerical_cols` to resolve this issue.



In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 1. Separate the target variable from the features
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# 2. Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# 3. Identify numerical columns and 4. categorical column
# Exclude 'ocean_proximity' from numerical features
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns.drop('ocean_proximity', errors='ignore').tolist()
categorical_cols = ['ocean_proximity']

print("Numerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)

# 5. Create a ColumnTransformer for preprocessing

# Pipeline for numerical features: impute missing values with the median, then scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical features: one-hot encode
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# 6. Apply the ColumnTransformer to both the training and test feature sets
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Display shapes of processed data to confirm transformation
print("\nShape of processed X_train:", X_train_processed.shape)
print("Shape of processed X_test:", X_test_processed.shape)

Shape of X_train: (16512, 9)
Shape of X_test: (4128, 9)
Shape of y_train: (16512,)
Shape of y_test: (4128,)
Numerical columns: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
Categorical columns: ['ocean_proximity']

Shape of processed X_train: (16512, 13)
Shape of processed X_test: (4128, 13)


## Define the model architecture

### Subtask:
Specify the type of model to be used (e.g., linear regression, neural network, decision tree) and its specific architecture (e.g., number of layers, neurons, activation functions).


**Reasoning**:
The subtask is to define the model architecture. As instructed, I will import `LinearRegression`, instantiate it, and print the model to display its default parameters. This action fulfills the current subtask requirement.



In [5]:
from sklearn.linear_model import LinearRegression

# Instantiate a LinearRegression model
model = LinearRegression()

# Print the model to display its default parameters
print(model)

LinearRegression()


## Train the model

### Subtask:
Train the defined model using the preprocessed training data. This involves fitting the model to the data, optimizing its parameters using a chosen algorithm (e.g., gradient descent), and monitoring its performance.


**Reasoning**:
As instructed, I will train the previously instantiated `LinearRegression` model using the preprocessed training features and the training target variable.



In [6]:
print("Training the Linear Regression model...")
model.fit(X_train_processed, y_train)
print("Model training complete.")

Training the Linear Regression model...
Model training complete.


## Evaluate the model

### Subtask:
Assess the trained model's performance using appropriate evaluation metrics on the test dataset. This step helps determine how well the model generalizes to unseen data.


## Summary:

### Data Analysis Key Findings

*   The `Housing.csv` dataset, containing 20640 entries and 10 columns, was successfully loaded.
*   Initial data inspection revealed missing values in the `total_bedrooms` column (20433 non-null entries out of 20640).
*   The data was successfully preprocessed:
    *   The target variable, `median_house_value`, was separated from the features.
    *   The dataset was split into training (16512 samples) and testing (4128 samples) sets.
    *   Missing numerical values were imputed using the median strategy, and numerical features were scaled using `StandardScaler`.
    *   The categorical `ocean_proximity` feature was one-hot encoded, resulting in processed feature sets with 13 columns (e.g., `X_train_processed` shape: 16512, 13).
*   A `LinearRegression` model was defined and successfully trained on the preprocessed training data.

### Insights or Next Steps

*   The next crucial step is to evaluate the trained `LinearRegression` model's performance on the unseen test dataset to understand its generalization capabilities.
*   Consider exploring other regression models or hyperparameter tuning for the `LinearRegression` model if initial evaluation results are not satisfactory.
