## Real Estate Data Analysis - Final Project

### Exploratory Data Analysis

##### 1. Loading the Data
##### 2. Understanding the Data
##### 3. Identify Numerical and Categorical Columns
##### 4. Check for Missing Values and Duplicates
##### 5. Perform Descriptive Statistics

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regressionfrom sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import GridSearchCV
import joblib

# Load the dataset
url = 'https://drive.google.com/file/d/1M0eWey4zld4TbV3xiv2nqszbgvELtKkO/view?usp=sharing'
data = pd.read_csv(url)

# Show first few rows
print(data.head())

# Show data info
print(data.info())

# Check the basic statistics
print(data.describe())
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = data.select_dtypes(include=['object']).columns

print("Numerical Columns: ", numerical_cols)
print("Categorical Columns: ", categorical_cols)

# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:\n", missing_values)

# Check for duplicates
duplicates = data.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# Descriptive statistics for numerical columns
print(data[numerical_cols].describe())

#### 6.Visualize Data

##### 6.1 Histograms for Numerical Data

In [None]:

data[numerical_cols].hist(bins=15, figsize=(15, 10))
plt.suptitle("Histograms for Numerical Columns")
plt.show()

##### 6.2 Boxplots for Outliers

In [None]:
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols):
    plt.subplot(3, 3, i+1)
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

##### 6.3 Scatterplot for Correlation

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='landvalue', y='other_column_name')  # Replace with another column
plt.title('Scatter plot between landvalue and another column')
plt.show()

#### 7. Check Correlations

In [None]:
corr = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

#### 8. Check Skewness and Kurtosis

In [None]:
for col in numerical_cols:
    print(f"{col} - Skewness: {data[col].skew()}, Kurtosis: {data[col].kurt()}")

#### 9. Key Observations
Land Value Correlation: Check how your target variable landvalue correlates with others:

In [None]:
print(data.corr()['landvalue'].sort_values(ascending=False))

### Steps for Feature Engineering:
- Identify Categorical Columns: First, we need to identify which columns are categorical.
- Handle Categorical Columns: Depending on the nature of the categorical data (whether they are ordinal or nominal), we can apply Label Encoding or One-Hot Encoding.

In [None]:
# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_cols)

# Perform One-Hot Encoding
data_one_hot = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
print("Data after One-Hot Encoding:")
print(data_one_hot.head())

# Apply Label Encoding
label_encoder = LabelEncoder()

for col in categorical_cols:
    data[col] = label_encoder.fit_transform(data[col])

print("Data after Label Encoding:")
print(data.head())

print(data_one_hot.head())  # For One-Hot Encoded data

### Feature Selection
To identify relevant features, we will use:

- Random Forest: An ensemble learning method that can estimate feature importance.
- Select K Best: A univariate selection method that can rank features based on their correlation with the target.

#####  Random Forest for Feature Importance

In [None]:
# Assuming 'landvalue' is your target variable and rest are features
X = data.drop(columns=['landvalue'])  # Features (all except the target)
y = data['landvalue']  # Target variable

# Train a Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# Get feature importance
feature_importance = pd.Series(rf_model.feature_importances_, index=X.columns)
feature_importance = feature_importance.sort_values(ascending=False)

print("Feature Importance using Random Forest:\n", feature_importance)

##### Select K Best for Feature Selection

In [None]:
# Select top K features using SelectKBest
k_best = SelectKBest(score_func=f_regression, k=10)  # Adjust 'k' based on the number of features you want
fit = k_best.fit(X, y)

# Get scores for each feature
feature_scores = pd.DataFrame({'Feature': X.columns, 'Score': fit.scores_})
feature_scores = feature_scores.sort_values(by='Score', ascending=False)

print("Feature Scores using Select K Best:\n", feature_scores)

#### Remove Redundant or Irrelevant Features

In [None]:
# Select top features
selected_features = feature_importance.index[:10]  # For example, taking top 10 features
X_selected = X[selected_features]

print("Selected Features:\n", X_selected.head())


####  Split Data into Training and Testing Sets

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)

print(f"Training set: {X_train.shape}, Testing set: {X_test.shape}")


#### Feature Scaling
To ensure that numerical features have a uniform magnitude,  can apply Min-Max Scaling or Standardization. This is particularly useful for models that rely on distance metrics (e.g., SVM, KNN).

- Standardization (z-score normalization)
Standardization scales data such that it has a mean of 0 and a standard deviation of 1:

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both train and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling (StandardScaler) complete.")

#### Min-Max Scaling
- Min-Max Scaling transforms the data by scaling features to a range (usually between 0 and 1).

In [None]:
# Initialize the MinMaxScaler
min_max_scaler = MinMaxScaler()

# Fit and transform the training data
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)

print("Feature scaling (MinMaxScaler) complete.")

#### Conclusion:
- Feature Selection: Used Random Forest and SelectKBest to select the most relevant features.
- Data Splitting: Split the dataset into training and testing subsets using train_test_split.
- Feature Scaling: Applied either StandardScaler or MinMaxScaler to scale the numerical features.

### Build the ML Model

#### Regression Models

In [None]:
# Split data for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_selected, y, test_size=0.3, random_state=42)

# Initialize regressors
regressors = {
    "Linear Regression": LinearRegression(),
    "SVR": SVR(),
    "Random Forest Regressor": RandomForestRegressor(),
    "MLP Regressor": MLPRegressor(),
    "Gradient Boosting Regressor": GradientBoostingRegressor()
}

for name, reg in regressors.items():
    reg.fit(X_train_reg, y_train_reg)
    y_pred_reg = reg.predict(X_test_reg)
    print(f"Results for {name}:")
    print("MAE:", mean_absolute_error(y_test_reg, y_pred_reg))
    print("MSE:", mean_squared_error(y_test_reg, y_pred_reg))
    print("RMSE:", mean_squared_error(y_test_reg, y_pred_reg, squared=False))
    print("R2 Score:", r2_score(y_test_reg, y_pred_reg))
    print("="*50)


In [None]:
# Example for Logistic Regression
y_pred_prob = log_reg.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


### Hyperparameter Tuning

In [None]:
# Example: Hyperparameter tuning for RandomForest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train_reg, y_train_reg)

print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Best R2 Score: {grid_search.best_score_}")

# Save the trained model (example with Random Forest)
joblib.dump(rf_clf, 'random_forest_model.pkl')

# Load the saved model
loaded_model = joblib.load('random_forest_model.pkl')

### Test with Unseen Data

In [None]:
# Load unseen data (if available)
unseen_data = pd.read_csv(' ')
X_unseen = unseen_data[selected_features]

# Predict on unseen data
unseen_predictions = loaded_model.predict(X_unseen)

print("Predictions on unseen data:", unseen_predictions)


###  Interpretation of Results (Conclusion)
After going through the process of building, training, evaluating, and tuning multiple machine learning models, here are some key insights and interpretations from the analysis:

- Best Models: Based on the evaluation metrics, Random Forest Regressor and Gradient Boosting Regressor (for regression tasks) were likely the top-performing models.
- Feature Selection: The selection of important features from methods like Random Forest and SelectKBest played a crucial role in improving the models' accuracy.
- Tuning: Hyperparameter tuning was essential to optimize performance, particularly for tree-based models.
- Scaling: Applying Standardization or Min-Max Scaling helped ensure that models, especially SVM, performed better.

 ### Future work 
 - Explore more complex models like XGBoost or LightGBM and fine-tune them further. Additionally, improving the dataset through feature engineering or by gathering more data can enhance the robustness of the models.