### Subtask 1: Load the dataset
- Load the dataset from the specified CSV file located at 'F:\ITShoulders\AI_Data_Science_agent\temp_uploads\housing.csv' into a pandas DataFrame.


In [1]:
# Load the dataset into a pandas DataFrame
import pandas as pd

file_path = r'F:\ITShoulders\AI_Data_Science_agent\temp_uploads\housing.csv'
housing_df = pd.read_csv(file_path)
print(housing_df.head())


   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


### Subtask 2: Initial Data Exploration
- Perform initial data exploration by viewing the first few rows of the DataFrame, checking the data types of each column, and identifying any missing values.


In [2]:
# Initial data exploration
# Checking data types and missing values in the DataFrame
print(housing_df.info())
print('\nMissing values in each column:')
print(housing_df.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None

Missing values in each column:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median

### Subtask 3: Exploratory Data Analysis (EDA)
- Conduct exploratory data analysis (EDA) to derive insights, including generating statistical summaries, visualizing the distribution of key features like house prices, and analyzing correlations between features.


In [3]:
# Exploratory Data Analysis
import matplotlib.pyplot as plt
import seaborn as sns

# Generating statistical summaries
stats_summary = housing_df.describe()
print(stats_summary)

# Visualizing the distribution of median house value
plt.figure(figsize=(10, 6))
sns.histplot(housing_df['median_house_value'], bins=30, kde=True, color='blue')
plt.title('Distribution of Median House Value')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.show()

# Analyzing correlations between features
plt.figure(figsize=(12, 8))
correlation_matrix = housing_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Housing Features')
plt.show()


          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.000000  20640.000000  20640.000000   20640.000000   
mean       537.870553   1425.476744    499.539680       3.870671   
std        421.385070   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900   
25%        296.00000

  plt.show()


ValueError: could not convert string to float: 'NEAR BAY'

### Handling error
- The error occurred because the 'ocean_proximity' column is of object type and not numeric.
- We will exclude the 'ocean_proximity' column for the correlation matrix as it is a categorical feature.


In [4]:
# Corrected Correlation Matrix Excluding 'ocean_proximity'
plt.figure(figsize=(12, 8))
correlation_matrix = housing_df.drop(columns=['ocean_proximity']).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numeric Housing Features')
plt.show()


  plt.show()


### Subtask 4: Data Preprocessing
- Preprocess the data by handling missing values, encoding categorical data, and normalizing numerical features, preparing the dataset for model training.


In [5]:
# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate target variable from features
X = housing_df.drop('median_house_value', axis=1)
y = housing_df['median_house_value']

# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Preprocessing pipeline for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical data
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

# Combine numerical and categorical transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
    ])

# Preprocess the data
X_processed = preprocessor.fit_transform(X)
print('Data preprocessing completed.')
print(f'Processed feature shape: {X_processed.shape}')


Data preprocessing completed.
Processed feature shape: (20640, 13)


### Subtask 5: Split Dataset
- Split the dataset into training and testing subsets to facilitate model validation and performance evaluation.


In [6]:
# Split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)
print('Data split completed.')
print(f'Training data shape: {X_train.shape}, {y_train.shape}')
print(f'Testing data shape: {X_test.shape}, {y_test.shape}')


Data split completed.
Training data shape: (16512, 13), (16512,)
Testing data shape: (4128, 13), (4128,)


### Subtask 6: Train Multiple Predictive Models
- Train multiple predictive models for house prices, including but not limited to techniques like linear regression, decision tree, and random forest.


In [7]:
# Train multiple predictive models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate the models
linear_model = LinearRegression()
decision_tree_model = DecisionTreeRegressor(random_state=42)
random_forest_model = RandomForestRegressor(random_state=42)

# Train the models
linear_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

print('Model training completed.')


Model training completed.


### Subtask 7: Evaluate and Compare Models
- Evaluate and compare the models' performance using metrics such as Mean Absolute Error (MAE) and R-squared score to find out which model predicts more accurately.


In [8]:
# Evaluate and compare models
from sklearn.metrics import mean_absolute_error, r2_score

# Function to evaluate model performance
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    return mae, r2

# Evaluate Linear Regression model
linear_mae, linear_r2 = evaluate_model(linear_model, X_test, y_test)

# Evaluate Decision Tree model
decision_tree_mae, decision_tree_r2 = evaluate_model(decision_tree_model, X_test, y_test)

# Evaluate Random Forest model
random_forest_mae, random_forest_r2 = evaluate_model(random_forest_model, X_test, y_test)

# Print evaluation results
print(f"Linear Regression - MAE: {linear_mae}, R-squared: {linear_r2}")
print(f"Decision Tree - MAE: {decision_tree_mae}, R-squared: {decision_tree_r2}")
print(f"Random Forest - MAE: {random_forest_mae}, R-squared: {random_forest_r2}")


Linear Regression - MAE: 50670.73824097192, R-squared: 0.6254240620553605
Decision Tree - MAE: 43600.67151162791, R-squared: 0.6354554029044672
Random Forest - MAE: 31643.65566860465, R-squared: 0.8169555593071559


### Subtask 8: Identify Best-Performing Model
- Based on the evaluation, identify and suggest the best-performing model for predicting house prices.


In [9]:
# Identify and suggest the best-performing model
# Based on MAE and R-squared scores from previous evaluations, the Random Forest model performs the best.
best_model = random_forest_model

print("The Random Forest model is identified as the best-performing model based on evaluation metrics.")


The Random Forest model is identified as the best-performing model based on evaluation metrics.


### Subtask 9: Develop Production-Ready Inference Function
- Develop a production-ready function that uses the best model to produce house price predictions, incorporating any necessary data preprocessing within the function.


In [10]:
# Inference function for the best model
import numpy as np

def predict_house_price(input_data):
    # Convert input data to DataFrame
    input_df = pd.DataFrame([input_data])
    
    # Preprocess the input data using the fitted preprocessor
    processed_input = preprocessor.transform(input_df)
    
    # Predict using the best model
    prediction = best_model.predict(processed_input)
    
    return prediction[0]

# Example usage of the inference function
sample_input = {
    'longitude': -122.23,
    'latitude': 37.88,
    'housing_median_age': 41.0,
    'total_rooms': 880.0,
    'total_bedrooms': 129.0,
    'population': 322.0,
    'households': 126.0,
    'median_income': 8.3252,
    'ocean_proximity': 'NEAR BAY'
}

# Predicting house price for sample input
predicted_price = predict_house_price(sample_input)
print(f'Predicted House Price: {predicted_price}')


Predicted House Price: 431942.36


### Conclusion
- Successfully developed a pipeline for house price prediction using various models.
- Identified Random Forest as the best-performing model based on MAE and R-squared metrics.
- Created a production-ready inference function to predict house prices on new data.
