For this project, we'll be predicting house prices using a dataset, and we'll use Python, Pandas, Scikit-learn, and Spark. Here is an outline of the procedure:

Collect the dataset
Explore and preprocess the data
Train a machine learning model
Evaluate the model
Save the model
1. Collect the dataset

We'll use the Boston Housing dataset for this project. It's a well-known dataset that contains information about houses in the Boston area, such as crime rate, average number of rooms per dwelling, and others. You can find the dataset in the UCI Machine Learning Repository or load it directly from the Scikit-learn library.

2. Explore and preprocess the data

First, let's import the necessary libraries and load the dataset:

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Now, you can explore the data using methods like data.head() and data.describe(). To preprocess the data, you can apply techniques like feature scaling, encoding categorical variables, and splitting the dataset into training and testing sets.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Scaling the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop('PRICE', axis=1))

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(scaled_data, data['PRICE'], test_size=0.3, random_state=42)


3. Train a machine learning model

Now, let's train a simple linear regression model using Scikit-learn:

In [3]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


4. Evaluate the model

Evaluate the performance of the model using metrics like Mean Squared Error (MSE) and R-squared:

In [4]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')


Mean Squared Error: 21.52
R-squared: 0.71


5. Save the model

Finally, save the trained model and scaler for future use:

In [5]:
import joblib

joblib.dump(model, 'linear_regression_model.pkl')
joblib.dump(scaler, 'scaler.pkl')


['scaler.pkl']

This project should give you a solid foundation for demonstrating your skills with Python, Pandas, Scikit-learn, and machine learning. You can extend this project by adding more advanced models, feature engineering, or by deploying it as a web service using Flask or FastAPI.

First, ensure you have FastAPI and Uvicorn installed. You can install them using the following command:

In [6]:
pip install fastapi uvicorn


Collecting fastapi
  Downloading fastapi-0.95.1-py3-none-any.whl (56 kB)
Note: you may need to restart the kernel to use updated packages.     -------------------------------------- 57.0/57.0 kB 599.1 kB/s eta 0:00:00

Collecting uvicorn
  Downloading uvicorn-0.21.1-py3-none-any.whl (57 kB)
     ---------------------------------------- 57.8/57.8 kB 3.2 MB/s eta 0:00:00
Collecting pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2
  Downloading pydantic-1.10.7-cp39-cp39-win_amd64.whl (2.2 MB)
     ---------------------------------------- 2.2/2.2 MB 3.3 MB/s eta 0:00:00
Collecting starlette<0.27.0,>=0.26.1
  Downloading starlette-0.26.1-py3-none-any.whl (66 kB)
     ---------------------------------------- 66.9/66.9 kB 3.5 MB/s eta 0:00:00
Collecting h11>=0.8
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
     ---------------------------------------- 58.3/58.3 kB 3.0 MB/s eta 0:00:00
Collecting typing-extensions>=4.2.0
  Downloading typing_extensions-4.5.0-py3-none-any

In [7]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# Load the dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target

# Scale the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop('PRICE', axis=1))

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(scaled_data, data['PRICE'], test_size=0.3, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

# Save the model and scaler
joblib.dump(model, 'linear_regression_model.pkl')
joblib.dump(scaler, 'scaler1.pkl')

SyntaxError: invalid syntax (1372504048.py, line 1)