### *let's proceed with loading and preprocessing the housing dataset you provided, which contains the columns "Avg. Area Income," "Avg. Area House Age," "Avg. Area Number of Rooms," "Avg. Area Number of Bedrooms," "Area Population," "Price," and "Address." We will focus on the data preprocessing steps:*


## Import Libraries:
### First, import the necessary libraries for data manipulation and analysis. You'll need pandas to load the dataset, numpy for numerical operations, and any other libraries you might use later in the project.

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Dataset:
## Load the housing dataset from the given file location:

In [25]:
dataset_path = "/USA_Housing.csv"  # Replace with the actual path to the CSV file
df = pd.read_csv(dataset_path)
print(df.head())

   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0      79545.458574             5.682861                   7.009188   
1      79248.642455             6.002900                   6.730821   
2      61287.067179             5.865890                   8.512727   
3      63345.240046             7.188236                   5.586729   
4      59982.197226             5.040555                   7.839388   

   Avg. Area Number of Bedrooms  Area Population         Price  \
0                          4.09     23086.800503  1.059034e+06   
1                          3.09     40173.072174  1.505891e+06   
2                          5.13     36882.159400  1.058988e+06   
3                          3.26     34310.242831  1.260617e+06   
4                          4.23     26354.109472  6.309435e+05   

                                             Address  
0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
2  9127 Eli

3. **Data Preprocessing:**
   Let's proceed with data preprocessing for this dataset:

   - **Handling Missing Values:**
     Check for missing values, and if there are any, decide how to handle them. In this example, we'll remove rows with missing data.

In [23]:
df.dropna(inplace=True)  # Remove rows with missing values

- **Encoding Categorical Variables:**
     If the "Address" column contains categorical data, you can drop it since it's unlikely to provide meaningful information for price prediction.


In [12]:
df = df.drop("Address", axis=1)

 - **Splitting the Data:**
     Split the dataset into features (X) and the target variable (y). In this case, "Price" is our target variable, and the rest of the columns are features.

In [17]:
X = df.drop("Price", axis=1)
y = df["Price"]
print(X.head())

   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0      79545.458574             5.682861                   7.009188   
1      79248.642455             6.002900                   6.730821   
2      61287.067179             5.865890                   8.512727   
3      63345.240046             7.188236                   5.586729   
4      59982.197226             5.040555                   7.839388   

   Avg. Area Number of Bedrooms  Area Population  
0                          4.09     23086.800503  
1                          3.09     40173.072174  
2                          5.13     36882.159400  
3                          3.26     34310.242831  
4                          4.23     26354.109472  


  - **Feature Scaling:**
     It's a good practice to scale the numerical features. We'll use StandardScaler for this.


In [19]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)

[[ 1.02865969 -0.29692705  0.02127433  0.08806222 -1.31759867]
 [ 1.00080775  0.02590164 -0.25550611 -0.72230146  0.40399945]
 [-0.68462916 -0.11230283  1.5162435   0.93084045  0.07240989]
 ...
 [-0.48723454  1.28447022 -2.17026949 -1.50025059 -0.29193658]
 [-0.05459152 -0.44669439  0.14154061  1.18205319  0.65111608]
 [-0.28831272  0.01521477 -0.19434166  0.07185495  1.04162464]]


- **Splitting the Data into Training and Testing Sets:**
     Split the data into training and testing sets for model evaluation.


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

To print the `X_train`, `X_test`, `y_train`, and `y_test` after performing the train-test split using `train_test_split` from scikit-learn, you can add print statements as follows:


In [20]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (4000, 5)
X_test shape: (1000, 5)
y_train shape: (4000,)
y_test shape: (1000,)
