# Abstract

Studi ini berfokus pada penggunaan regresi linier untuk memprediksi harga rumah di Jakarta dan Tebet, Indonesia. Setelah menganalisis data, kami memperoleh akurasi prediksi sebesar 65,64%. Hasil ini menunjukkan bahwa model yang digunakan cukup mampu menghasilkan prediksi yang akurat, meskipun masih terdapat ruang untuk perbaikan. Studi ini memberikan wawasan mengenai faktor-faktor yang memengaruhi harga rumah di wilayah tersebut serta menunjukkan kegunaan regresi linier sebagai alat untuk menganalisis tren pasar properti. Penelitian selanjutnya dapat mengembangkan studi ini dengan menyempurnakan model dan mengeksplorasi faktor tambahan yang mungkin memengaruhi harga rumah. Secara keseluruhan, eksperimen ini menyoroti potensi regresi linier dalam peramalan properti dan memberikan wawasan berharga mengenai pasar perumahan di Jakarta dan Tebet.

# Introduction

Dalam beberapa tahun terakhir, industri properti di Jakarta dan Tebet, Indonesia telah mengalami pertumbuhan yang pesat, menjadikannya area yang menarik untuk penelitian dan eksperimen. Salah satu tugas umum dalam bidang ini adalah meramalkan harga rumah, yang bisa menjadi tantangan karena sifat pasar yang kompleks dan dinamis. Dalam eksperimen ini, kami akan menggunakan regresi linier, sebuah metode statistik yang banyak digunakan, untuk menganalisis data dan membuat prediksi mengenai harga rumah di Jakarta dan Tebet. Dengan menerapkan teknik ini, kami bertujuan untuk memperoleh wawasan tentang faktor-faktor yang memengaruhi harga rumah serta meningkatkan kemampuan kami dalam memprediksinya secara akurat.

# Tinjauan Pustaka

**Regresi linier** adalah teknik statistik yang digunakan untuk membentuk hubungan antara variabel dependen dan satu atau lebih variabel independen. Berikut adalah beberapa kelebihan dan kekurangan dari regresi linier:

**Kelebihan:**

* Mudah dipahami dan diimplementasikan.
* Berguna untuk membuat prediksi dan meramalkan hasil di masa depan.
* Memberikan ukuran kekuatan dan arah hubungan antara variabel dependen dan variabel independen.
* Membantu mengidentifikasi variabel independen yang paling berpengaruh terhadap variabel dependen.
* Memungkinkan analisis terhadap pengaruh beberapa variabel independen secara bersamaan terhadap variabel dependen.

**Kekurangan:**

* Mengasumsikan hubungan linier antara variabel dependen dan independen, yang tidak selalu sesuai dengan kenyataan.
* Tidak dapat digunakan untuk menetapkan hubungan sebab-akibat antar variabel.
* Rentan terhadap nilai pencilan (outlier), yang dapat memengaruhi hasil analisis secara signifikan.
* Mengasumsikan independensi antar observasi, yang mungkin tidak terpenuhi dalam beberapa kasus.
* Tidak dapat digunakan untuk menganalisis data non-numerik atau variabel kategorikal tanpa transformasi tertentu.


## Load data

In [None]:
import pandas as pd

df = pd.read_excel('DATA RUMAH TEBET.xlsx')
df.info()

- Explanation of each attribute:
    1. Nama Rumah = House name
    2. LB = Total Building Area
    3. LT = Total Land Area
    4. KT = Number of Bedrooms
    5. KM = Number of Bathrooms
    6. GRS = Number of Car Capacity in the Garage
    7. Harga =  House prices (IDR)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df = df.loc[:,['LB', 'LT', 'KT', 'KM', 'GRS', 'HARGA']]
df.head()

# **Exploratory Data Analysis & Preprocessing**

### Data information
statistical description of the mean, quartiles, standard deviation, etc

In [None]:
df.describe()

### Identify outliers

#### Identification of outlier attribute LB

In [None]:
import seaborn as sns

sns.boxplot(x='LB', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the LB attribute are outliers

In [None]:
import numpy as np

q11,q31=np.percentile(df['LB'], [25,75])
s1 = q31-q11
ba1 = q31+(1.5*s1)
bw1 = q11-(1.5*s1)
print(q11)
print(q31)
print(s1)
print(ba1)
print(bw1)

In [None]:
# shows outlier data on the LB attribute
dt1 = df[(df['LB']<bw1) | (df['LB']>ba1)]
dt1.head()

# **Identifikasi Outlier**

#### Identification of outlier attribute LT

In [None]:
sns.boxplot(x='LT', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the LT attribute are outliers

In [None]:
q12,q32=np.percentile(df['LT'], [25,75])
s2 = q32-q12
ba2 = q32+(1.5*s2)
bw2 = q12-(1.5*s2)
print(q12)
print(q32)
print(s2)
print(ba2)
print(bw2)

In [None]:
# shows outlier data on the LT attribute
dt2 = df[(df['LT']<bw2) | (df['LT']>ba2)]
dt2.head()

#### Identification of outlier attribute KT

In [None]:
sns.boxplot(x='KT', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the KT attribute are outliers

In [None]:
q13,q33=np.percentile(df['KT'], [25,75])
s3 = q33-q13
ba3 = q33+(1.5*s3)
bw3 = q13-(1.5*s3)
print(q13)
print(q33)
print(s3)
print(ba3)
print(bw3)

In [None]:
# shows outlier data on the KT attribute
dt3 = df[(df['KT']<bw3) | (df['KT']>ba3)]
dt3.head()

#### Identification of outlier attribute KM

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the KM attribute are outliers

In [None]:
sns.boxplot(x='KM', data=df);

In [None]:
q14,q34=np.percentile(df['KM'], [25,75])
s4 = q34-q14
ba4 = q34+(1.5*s4)
bw4 = q14-(1.5*s4)
print(q14)
print(q34)
print(s4)
print(ba4)
print(bw4)

In [None]:
# shows outlier data on the KM attribute
dt4 = df[(df['KM']<bw4) | (df['KM']>ba4)]
dt4.head()

#### Identification of outlier attribute GRS

In [None]:
sns.boxplot(x='GRS', data=df)

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the GRS attribute are outliers

In [None]:
q15,q35=np.percentile(df['GRS'], [25,75])
s5 = q35-q15
ba5 = q35+(1.5*s5)
bw5 = q15-(1.5*s5)
print(q15)
print(q35)
print(s5)
print(ba5)
print(bw5)

In [None]:
# shows outlier data on the GRS attribute
dt5 = df[(df['GRS']<bw5) | (df['GRS']>ba5)]
dt5.head()

#### Identification of outlier attribute Harga

In [None]:
sns.boxplot(x='HARGA', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the Harga attribute are outliers

In [None]:
q16,q36=np.percentile(df['HARGA'], [25,75])
s6 = q36-q16
ba6 = q36+(1.5*s6)
bw6 = q16-(1.5*s6)
print(q16)
print(q36)
print(s6)
print(ba6)
print(bw6)

In [None]:
# shows outlier data on the Harga attribute
dt6 = df[(df['HARGA']<bw6) | (df['HARGA']>ba6)]
dt6.head()

### Data distribution

#### Distribution of data from attribute LB

In [None]:
import matplotlib.pyplot as plt

f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['LB'].plot(kind='kde')

f.add_subplot(1,2,2 )
plt.boxplot(df['LB'])

plt.show()

- Shows that most of the Building Area is at 200
- Data has many outliers

#### Distribution of data from attribute LT

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['LT'].plot(kind='kde')

f.add_subplot(1,2,2)
plt.boxplot(df['LT'])

plt.show()

- shows that most of the Land Area is in the number 200
- Data has many outliers

#### Distribution of data from attribute KT

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['KT'])

f.add_subplot(1,2,2)
plt.boxplot(df['KT'])

plt.show()

- Shows that most of the Number of Bedrooms is 4 and 5
- Data has few outliers

#### Distribution of data from attribute KM

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['KM'])

f.add_subplot(1,2,2)
plt.boxplot(df['KM'])

plt.show()

- Shows that most of the Number of Bathrooms is 4 and 5
- Data has few outliers

#### Distribution of data from attribute GRS

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['GRS'])

f.add_subplot(1,2,2)
plt.boxplot(df['GRS'])

plt.show();

- Shows that most of the Total Garage Capacity is 1 and 2 cars
- Data has few outliers

#### Distribution of data from attribute Harga

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['HARGA'].plot(kind='kde')

f.add_subplot(1,2,2)
plt.boxplot(df['HARGA'])

plt.show()

### Correlation between independent and dependent variable

In [None]:
# Bivariate analysis between independent variables and dependent variables
plt.figure(figsize=(10,8))
sns.pairplot(data=df, x_vars=['LT', 'LB', 'KM', 'KT', 'GRS'], y_vars=['HARGA'], height=5, aspect=0.75)

In [None]:
# correlation of independent variable and dependent variable
df.corr().style.background_gradient().format(precision=1)

## **Uji Statistika**

In [None]:
x = df.drop(columns=['HARGA'])
y = df['HARGA']

In [None]:
import statsmodels.api as sm

model=sm.OLS(y,x).fit()
predictions=model.predict(x)
model.summary()

In [None]:
# adding a constant variable
X=sm.add_constant(x)
model=sm.OLS(y,X).fit()
model.summary()

### Normality test

In the normality test we use the Prob Jarque Bera (JB) value from the above test of 0.00. With the following hypothesis:

- Determine the Hypothesis
     - H0 : Residuals are normally distributed
     - H1 : Residuals are not normally distributed
- Significance level
     - ∝=5% (∝=0.05)
- Test Statistics
     - p-value = 0.00
- Critical area
     - Reject H0 if p-value < α
- Decision
     - Because the p-value is equal to 0.00, where the p-value < α is 0.00 < 0.05, then reject H0.
- Conclusion
     - In this dataset the data is not normally distributed

### Multicollinearity test

In [None]:
from patsy import dmatrices
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

lm = smf.ols(formula = "HARGA~LB+LT+KT+KM+GRS", data = df).fit()
Y,X = dmatrices ("HARGA~LB+LT+KT+KM+GRS", data = df, return_type ="dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range (X.shape[1])]
print(vif)

Multicollinearity test to show whether there is a correlation between the independent variables in multiple linear regression.
- Determine the Hypothesis
     - H0 : VIF < 10 means there is no multicollinearity.
     - H1 : VIF > 10 means that there is multicollinearity.
- Significance level
     - ∝=5% (∝=0.05)
- Test Statistics
     - VIF :
     - Constant = 10,500
     -LB = 2,690
     - LT = 2,410
     - KT = 1,928
     - Miles = 2,122
     - GRS = 2,122
- Critical Area
     - Reject H0 if VIF > ∝
- Decision
     - Because the VIF value (LB = 2,690, LT = 2,410, KT = 1,928, KM = 2,122, and GRS = 2,122) < α then fails to Reject H0
- Conclusion
     - So, the data set does not have multicollinearity

### Heteroscedasticity test

In [None]:
lm=smf.ols(formula="HARGA~LB+LT+KT+KM+GRS",data=df).fit()
lm
resid=lm.resid
plt.scatter(lm.predict(),resid)

In [None]:
import statsmodels.stats as stats

stats.diagnostic.het_white(resid, lm.model.exog)

Based on the results of the heteroscedasticity test, it can be seen that the dots have no clear shape. And the scatter points above and below the number 0 on the Y axis. So it can be concluded that there is no heteroscedasticity problem in the regression model.

## Partial test

Partial test is used to determine whether the independent variable (X) has a significant (real) impact on the dependent variable (Y). From the data above, the p-value (Constant) is 0.064, the value (LB) is 0.000, the value (LT) is 0.000, the value (KT) is 0.000, the value (KM) is 0.000, and the value (GRS) is 0.001.
Here's the hypothesis:
- Hypothesis
     - H0 : βi = 0, i = 0,1,2 (There is no significant effect between X and Y)
     - H1 : βi ≠ 0, i = 0,1,2 (There is a significant effect between X and Y)
- Significance level
     - ∝=5% = 0.05
- Critical area
     - If p-value ≤ ∝ (0.05) → Reject H0
     - P-value : = 0.001 and = 0.000 ; ∝= 0.05
- Decision
     - Because the p-value for β1, β2, β3, β4, β5 < ∝ then reject
- Conclusion
     - In the dataset there is a significant influence between variable X (LB, LT, KT, KM, GRS) and variable Y (Harga).