<a href="https://colab.research.google.com/github/vipratiwari/Algorithems/blob/master/predict_car_price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement** - We want to build a machine learning model to predict the price of used cars based on various features such as car model, year of manufacture, mileage, and so on.

**Data Source **- We will use the publicly available Used Cars Dataset from Kaggle.

**Solution**- We will use Dask to load and preprocess the data in parallel, and Scikit-Learn to build a Random Forest Regression model to predict the price of used cars.

In [1]:
!pip install dask scikit-learn pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Load the dataset using Dask

In [None]:
import dask.dataframe as dd

# Download dataset
!wget https://www.kaggle.com/austinreese/craigslist-carstrucks-data/download

# Load dataset with Dask
df = dd.read_csv('download')

Preprocess the dataset

In [None]:
# Drop irrelevant columns
df = df.drop(columns=['id', 'url', 'region_url', 'image_url', 'description', 'vin', 'county', 'state'])

# Convert categorical variables to dummy variables
df = dd.get_dummies(df, columns=['manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color'])

# Fill missing values with the mean of each column
df = df.fillna(df.mean())

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['price']), df['price'], test_size=0.2, random_state=42)

Define a Scikit-Learn model

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

Train the model using Dask for parallel processing

In [None]:
model.fit(X_train.compute(), y_train.compute())

Evaluate the model on the test set

In [None]:
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test.compute())
mse = mean_squared_error(y_test.compute(), y_pred)
print(f'Mean Squared Error: {mse:.2f}')

That's it! This example uses Dask to load and preprocess a large dataset of used cars, and Scikit-Learn to build a Random Forest Regression model to predict the price of a used car based on various features such as car model, year of manufacture, mileage, and so on. Note that this dataset is just an example and the model may not be very accurate in practice. Also, the number of trees and other hyperparameters can be tuned to improve the model's performance.