Task: Write Python code to solve the Kaggle Boston Housing regression problem writing a web crawier that download the dataset from https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv

Follow the steps below:

Step 1: Import the necessary libraries for data manipulation and model building.

Step 2: Load the dataset from the 'train.csv' file and display the first few rows to explore the data.

Step 3: Preprocess the data by separating the features (X) and the target variable (y) and use one-hot encoder to deal with the data in row "State". Split the data into training and testing sets using the train_test_split function with a test size of 20% and a random state of 42.

Step 4: Use Lasso linear regression model to train it using the training data (X_train and Y_train) and list the MSE for best model using different number of variables. Make a table to list the result, and the name of variables.

Step 5: Make predictions on the testing data (X_test) using the trained model. Calculate the mean squared error (MSE) between the predicted values and the actual target values (y_test) using the mean_squared_error function from sklearn.metrics. Print the calculated MSE.

Note: Replace 'train.csv' with the actual dataset file name.

```
# 此內容會顯示為程式碼
```



In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Step 1: Import necessary libraries

# Step 2: Load the dataset from the URL
url = "https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv"
data = pd.read_csv(url)

# Display the first few rows to explore the data
print(data.head())

# Step 3: Preprocess the data
label_encoder = LabelEncoder()
data['State'] = label_encoder.fit_transform(data['State'])
onehot_encoder = OneHotEncoder(sparse=False)
encoded_state = onehot_encoder.fit_transform(data[['State']])
state_df = pd.DataFrame(encoded_state, columns=['State_' + str(int(i)) for i in range(encoded_state.shape[1])])
data = pd.concat([data, state_df], axis=1)
data.drop(['State'], axis=1, inplace=True)

X = data.drop(['Profit'], axis=1)
y = data['Profit']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Lasso linear regression model with different numbers of variables
results = []

for num_features in range(1, X_train.shape[1] + 1):
    model = Lasso(alpha=1.0)  # You can adjust the alpha value as needed
    model.fit(X_train.iloc[:, :num_features], y_train)
    y_pred = model.predict(X_test.iloc[:, :num_features])
    mse = mean_squared_error(y_test, y_pred)
    results.append((num_features, X_train.columns[:num_features].tolist(), mse))

# Create a DataFrame to display the results
result_df = pd.DataFrame(results, columns=['Num Features', 'Selected Features', 'MSE'])

# Step 5: Make predictions on testing data and calculate MSE for the best model
best_model = result_df.loc[result_df['MSE'].idxmin()]
best_num_features = best_model['Num Features']
best_features = best_model['Selected Features']
best_mse = best_model['MSE']

print("Results for Different Numbers of Variables:")
print(result_df)

print("\nBest Model (Minimum MSE):")
print(f"Number of Features: {best_num_features}")
print(f"Selected Features: {best_features}")
print(f"Best MSE: {best_mse}")

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94
Results for Different Numbers of Variables:
   Num Features                                  Selected Features  \
0             1                                        [R&D Spend]   
1             2                        [R&D Spend, Administration]   
2             3       [R&D Spend, Administration, Marketing Spend]   
3             4  [R&D Spend, Administration, Marketing Spend, S...   
4             5  [R&D Spend, Administration, Marketing Spend, S...   
5             6  [R&D Spend, Administration, Marketing Spend, S...   

            MSE  
0  5.951096e+07  
1  8.376413e+07

