# Data Science Project Setup Process

This notebook will go over the general process of setting up data science projects for future models

## 1. Import and Analyze the Data
Our first step will be to import the training (and test) data and start analyizing the data. What we need to focus on is:
- Which features (columns) will be good for our model, and which are unneeded?
- How should we handle NaN values?
- Do we have categorical data? How should we handle it.

## 2. Decide Preprocessing Steps
After analyzing our data and figuring out what we need to cut/change, we go forward with preprocessing. This includes:
- Imputing NaN values for both numeric and categorical columns
- Handling categorical data (dropping/one-hot encoding)

## 3. Setup model
After analyizing our data and setting up preprocessing, it's time to setup our model, choosing some starting parameters and which variables we will need to change for optimization.

## 4. Pipeline Setup
After figuring out our preprocessing and model setup, it's time to plug them into a pipeline to get ready for testing! Generally we will go with this setup:
```python
numerical_transformer = SimpleImputer(strategy='mean')
categorical_transform = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),     # transform the numerical_cols with the numerical transformer
        ('cat', categorical_transform, categorical_cols)    # transform the categorical_cols with the categorical transformer
    ]
)

## 5. Define our model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Setup our pipeline
my_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])
```

## 6. Training and Tweaking Time!
With the pipeline setup, it's time to start training and testing our model. With cross validation, this is when we'll start tweaking things to get better scores. Some things to tweak include:
- Model's variables (n-estimators, etc)
- Model features
- Training size

## 7. Make predictions and continue tweaking!
When we feel we've finally gotten a good MAE score and our model is looking good, it's time to start making final predictions! And occasionally tweak things if we feel it good be better.