# Problem statement
Over the years, Paris has faced alot of issues in relation to housing including:

    . Affordability
    . Supply and demand imbalance
    . Rising rent prices
    . Urbanization
    . Regulatory challenges
    . Social Housing Shortages
    . Market Speculation
    . Impact of External Events

In the face of burgeoning urban populations and evolving socio-economic dynamics, a distinguished group of investors seeks to establish a real estate 
firm in the vibrant city of Paris. They wish to address the pressing issue of affordable housing that has emerged as a critical challenge.
Due to the increased number of middle-class people who would also like the high-end properties in Paris, Mali Safi and sons has been engaged to
analyze existing data, draw meaningful insights, and guide their strategic decisions to affordable housing.

Essentially the real estate company wants:

i. To identify the variables affecting the prices 

ii. To construct a regression model that correlates the property prices with the variables
  
iii. To evaluate the performance of the regression model in identifying the factors affecting property prices.

# Data Understanding

In [None]:
#importing relevant modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('ParisHousing.csv')
df.head()

In [None]:
#Checking the shape of the data
df.shape

In [None]:
#Understanding the general information of the data
df.info()

In [None]:
#Understanding the descriptive statistics of the data
df.describe()

# Data preparation and cleaning

## Defining variables


Dependent variable:
   
    1. Price

Independent variables: 
    
    1. Square meters
    2. No of rooms
    3. Has pool
    4. Has yard
    5. Year made
    6. Is new built
    7. Has storm protector
    8. Basement
    9. Attic
    10. Garage
    11. Storage room
    12. Guest room
Categorical Variables:

    1. has_pool
    2. has_storm_protector
    3. has_basement
    4. has_attic
    5. has_garage
    6. has_storage_room
    7. has_guest_room
    8. made: year the property was made
    9. isNewBuilt
    10. floors

Continuous Variables:

    1. Price
    2. square meters: Size of the Property
    3. Number of Rooms

In [None]:
#Creating a new dataframe for analysis
new_df = df.loc[:, ['squareMeters', 'numberOfRooms', 'hasYard', 'hasPool', 'floors', 'numPrevOwners', 'made', 'isNewBuilt', 'hasStormProtector',
                      'basement', 'attic', 'garage', 'hasStorageRoom', 'hasGuestRoom', 'price']]
new_df

In [None]:
#Checking for null values
new_df.isnull().sum()

In [None]:
#Checking for duplicates
new_df.duplicated().sum()

# Checking for outliers

In [None]:
#checking for outliers for price
plt.figure(figsize=(10,2))

sns.boxplot(x = 'price', data = new_df)

# Display the plot
plt.show()

In [None]:
#checking for outliers for squareMeters
plt.figure(figsize=(10,2))

sns.boxplot(x = 'squareMeters', data = new_df)

# Display the plot
plt.show()

In [None]:
#checking for outliers for numberOfRooms
plt.figure(figsize=(10,2))

sns.boxplot(x = 'numberOfRooms', data = new_df)

# Display the plot
plt.show()

In [None]:
#checking for outliers for hasYard

plt.figure(figsize=(10,2))

sns.boxplot(x = 'hasYard', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for hasPool
plt.figure(figsize=(10,2))

sns.boxplot(x = 'hasPool', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for isNewBuilt
plt.figure(figsize=(10,2))

sns.boxplot(x = 'isNewBuilt', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for hasStormProtector
plt.figure(figsize=(10,2))

sns.boxplot(x = 'hasStormProtector', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for hasStorageRoom
plt.figure(figsize=(10,2))

sns.boxplot(x = 'hasStorageRoom', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for numPrevOwners

plt.figure(figsize=(10,2))

sns.boxplot(x = 'numPrevOwners', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for made
plt.figure(figsize=(10,2))

sns.boxplot(x = 'made', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for basement
plt.figure(figsize=(10,2))

sns.boxplot(x = 'basement', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for garage
plt.figure(figsize=(10,2))

sns.boxplot(x = 'garage', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for attic
plt.figure(figsize=(10,2))

sns.boxplot(x = 'attic', data = new_df)
# Display the plot
plt.show()

In [None]:
#checking for outliers for hasGuestRoom
plt.figure(figsize=(10,2))

sns.boxplot(x = 'hasGuestRoom', data = new_df)
# Display the plot
plt.show()

## Normalization and Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fitting and transforming the (new_df) data
scaled_new_df = scaler.fit_transform(new_df)
# Print the original and scaled data
print("Original Data:")
print(new_df)
print("\nScaled Data:")
print(scaled_new_df)


In [None]:
scaled_new_df

In [None]:
#assigning the columns to the scaled data for easier visualisation

new_df_columns = ['squareMeters', 'numberOfRooms', 'hasYard', 'hasPool', 'floors', 'numPrevOwners', 'made', 'isNewBuilt', 'hasStormProtector',
                      'basement', 'attic', 'garage', 'hasStorageRoom', 'hasGuestRoom', 'price']

scaled_df = pd.DataFrame(scaled_new_df, columns=new_df_columns)

scaled_df

# Exploratory Data analysis

In [None]:
new_df.describe()

In [None]:
scaled_df.describe()

In [None]:
#distribution of the scaled data before transformation

fig, axes = plt.subplots(6, 3, figsize=(18, 12))

axes = axes.flatten()

for i, column in enumerate(scaled_df.columns):
    ax = axes[i]
    sns.histplot(scaled_df[column], kde=True, ax=ax)
    ax.set_title(f'Histogram of {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
#Transforming scaled_df from uniform distribution to normal distribution via QuantileTransformation
from sklearn.preprocessing import QuantileTransformer
from matplotlib import pyplot as plt
# Create an instance of QuantileTransformer
quantile_transformer = QuantileTransformer(output_distribution='normal')

transformed_df = quantile_transformer.fit_transform(scaled_df)
transformed_df = pd.DataFrame(transformed_df, columns=scaled_df.columns)
transformed_df

In [None]:
#Normal distributed data
fig, axes = plt.subplots(6, 3, figsize=(18, 12))

axes = axes.flatten()

for i, column in enumerate(transformed_df.columns):
    ax = axes[i]
    sns.histplot(transformed_df[column], kde=True, ax=ax)
    ax.set_title(f'Histogram of {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
correlation_matrix = transformed_df.corr()

plt.figure(figsize=(10, 8))
heatmap = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

heatmap.set_title('Heatmap of Variable Relationships', fontsize=16)
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=45, horizontalalignment='right')

plt.show()

In [None]:
correlation_with_price =  transformed_df.corr()['price'].sort_values(ascending=False)

correlation_with_price.plot(kind='bar', figsize=(12, 6), color='skyblue')
plt.title('Correlation with Price')
plt.xlabel('Features')
plt.ylabel('Correlation')
plt.show()

## Using training and testing sets by splitting the dataset

Splitting the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
# Specify the features (X) and the target variable (y)
X = transformed_df.drop('price', axis=1)
y = transformed_df['price']

# Split the dataset into training and testing sets
# The test_size parameter specifies the proportion of the dataset to include in the test split (here, 20%)
# The random_state parameter ensures reproducibility by fixing the random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now you can use X_train and y_train for training your model
# and X_test and y_test for evaluating its performance
X_test

Training the model using the training set

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create a linear regression model
model = LinearRegression()

# Train the model using the training set
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared Score: {r2}')
# The model is now trained and can be used for making predictions

OLS regression summary

In [None]:
results_1 = smf.ols('price ~ squareMeters', data = transformed_df).fit()
results_1.summary()

## Using two variables for train and test sets

In [None]:
from sklearn.metrics import mean_squared_error

# Step 1 : split the data into train and test sets

train_data,test_data=train_test_split(transformed_df,train_size=0.8,random_state=3)

# Step 2 : Train the model on the Training set
reg=LinearRegression()

x_train=np.array(train_data['squareMeters']).reshape(-1,1)

y_train=np.array(train_data['price']).reshape(-1,1)

reg.fit(x_train,y_train)

# Step 3 : Predict the test results

x_test=np.array(test_data['squareMeters']).reshape(-1,1)

y_test=np.array(test_data['price']).reshape(-1,1)
print('R squared training',round(reg.score(x_train,y_train),3))

print('R squared testing',round(reg.score(x_test,y_test),3) )

print('intercept',reg.intercept_)

print('coefficient',reg.coef_)

#### Intercept and Coefficient interpretation:
The intercept is 2.27299494e-05 , and the coefficient for the independent variable is 1.00034209.
This suggests that the model is a perfect fit to the data, and it predicts the dependent variable (price) based on the independent variable 
(squareMeters) with very high accuracy. The coefficient for squareMeters is very close to 1, indicating a strong positive linear relationship 
between the independent and dependent variables.


In [None]:
from statsmodels.stats.anova import anova_lm

# Hypothesis Testing for Multiple Variables Influence on Price
m_1 = 'price ~ basement + numberOfRooms + squareMeters + attic + garage'
result3 = smf.ols(m_1, data=transformed_df).fit()

anova_table = anova_lm(result3)

alpha = 0.05  # significance level

if anova_table['PR(>F)']['basement'] < alpha and anova_table['PR(>F)']['numberOfRooms'] < alpha \
    and anova_table['PR(>F)']['squareMeters'] < alpha and anova_table['PR(>F)']['attic'] < alpha \
    and anova_table['PR(>F)']['garage'] < alpha:
    print("Reject the null hypothesis: The combination of variables significantly influences the price.")
else:
    print("Do not reject the null hypothesis: The combination of variables does not significantly influence the price.")

In [None]:
from scipy.stats import linregress

# Hypothesis Testing for Square Meters and Price Relationship
slope, intercept, r_value, p_value, std_err = linregress(transformed_df['squareMeters'], transformed_df['price'])

alpha = 0.05  # significance level

if p_value < alpha:
    print("Reject the null hypothesis: There is a significant linear relationship between square meters and price.")
else:
    print("Fail to reject the null hypothesis: There is no significant linear relationship between square meters and price.")

## Plotting the Predicted Regression line for two variables.

In [None]:
_, ax = plt.subplots(figsize= (10, 8))

plt.scatter(x_test, y_test, color= 'blue', label = 'data')

plt.plot(x_test, reg.predict(x_test), color='red', label= 'Predicted Regression line')

plt.xlabel('squareMeters')
plt.ylabel('price')
plt.legend()
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

Summary of regression results between multiple variables.

In [None]:
m_1 = 'price ~ basement + numberOfRooms + squareMeters + attic + garage'

result3 = smf.ols(m_1, data = transformed_df).fit()
print(result3.summary())

Predicted regression line for multiple variables

In [None]:
_, ax = plt.subplots(figsize= (10, 8))

plt.scatter(x_test, y_test, color= 'blue', label = 'data')

plt.plot(x_test, reg.predict(x_test), color='red', label= ' Predicted Regression line')

plt.xlabel(((('numberOfRooms', 'basement','attic', 'garage'))))
plt.ylabel('price')
plt.legend()
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

## Visualizing the model's predictions and comparing them to the actual values

In [None]:
# Scatter plot of actual vs. predicted values
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title('Actual vs. Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()

In [None]:
# Line plot of actual vs. predicted values
plt.plot(y_test, label='Actual', marker='o')
plt.plot(y_pred, label='Predicted', marker='x')
plt.title('Actual vs. Predicted Values Over Data Points')
plt.xlabel('Data Points')
plt.ylabel('Values')
plt.legend()
plt.show()

# Summary and Conclusion

The comprehensive analysis conducted by Mali Safi and Sons serves as a pivotal guide for the Distinguished Group of Parisian Investors venturing into the real estate sector. The significance of the models lies in their ability to offer nuanced insights, predictive accuracy, and strategic guidance tailored to the unique dynamics of the Paris housing market.

In culmination with the formulated hypothesis that intrinsically correlates to the models created and clear insight that influences the clear decision-making process of the Distinguished group Investors.

The models, developed through meticulous data analysis, not only identify key variables i.e ('numberOfRooms', 'basement','attic', 'garage') influencing property prices but also provide a robust framework for predictive modeling. These tools enable Mali Safi and Sons to offer valuable guidance to the investors, informing their decisions on strategic investments, risk mitigation, and the development of tailored solutions to address the pressing issue of affordable housing.

By delving into the relationships between various features and property prices, e.g take a look at the price for a house with 83841sqmeters which is 8390030.5. The models empower the investors to navigate the intricacies of the market with a data-driven approach. Statistical significance achieved through hypothesis testing adds credibility to the insights derived, ensuring that the investors can make informed decisions based on a solid foundation of analysis.From the models, both the R-squared and the adjusted R-squared eaquals to 1 which proofs a perfect regression and both models show a significant linear relationship between the variables and the property prices.

Furthermore, the models facilitate effective communication of complex insights, allowing Mali Safi and Sons to convey the significance of variables and market trends to the investors in a clear and actionable manner. In essence, these analytical tools not only serve as predictive instruments but also as strategic enablers, guiding the Distinguished Group of Parisian Investors towards well-informed and strategic decisions in the realm of real estate.