# Evaluation: Basic Modeling

In this task, we ask you to do do basic model building. We took the dataset from a competition on [Kaggle](https://www.kaggle.com/austinreese/craigslist-carstrucks-data). Using the data provided, create a model of `price` based on the other column values.

## Column Descriptions

column | description
------ | --------
`id` | entry ID
`region` | craigslist region
`price` | price of car
`year` | entry year
`manufacturer` | manufacturer of vehicle
`model` | model of vehicle
`condition` | condition of vehicle
`cylinder` | snumber of cylinders
`fuel` | fuel type
`odometer` | miles traveled by vehicle
`title_status` | title status of vehicle
`transmission` | transmission of vehicle
`vin` | vehicle identification number
`drive` | type of drive
`sizes` | ize of vehicle
`type` | generic type of vehicle
`paint_color` | color of vehicle
`description` | listed description of vehicle
`state` | state of listing
`lat` | latitude  of listing
`long` | longitude of listing


## Relevant Midas Techniques

* **Splitting your data into train and test**: you can do `test_df, train_df = cars_df.split(100)`.
* **Sampling for faster interactions**: The original data may be too large to be analyzed interactive---you may see a large delay. However, you can sample a subset of data to do interactive analysis first---`sample_train_df = train_df.sample(k=500)`, and then verify your results on the full dataset with static visualizations. You can record the query used by copy-ing out from the cell dropdown (📋), or directly snapping the visualization (📷), which will contain the code to derive the data in the comment.
* **Data cleaning and feature enginerring** by modifying or adding new processed columns---for instance the `condition` columns.</font>
* **Logging your insights**: often the EDA results will inform your model building. Please log your insights both for that, and for sharing with others.

In [3]:
from midas import Midas
m = Midas()
 
cars_df = m.from_file("./data/vehicles_5k.csv")
cars_df.head(3)

id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,vin,drive,size,type,paint_color,description,state,lat,long,price_bin
7033757953,texoma,875,2002,ford,f-150,salvage,,gas,165000.0,clean,automatic,,,,,black,2002 Ford F-150. Side of the vehicle is damaged NO FRAME ...,ok,33.5429,-95.7416,0
7049364557,ft myers / SW florida,11999,2000,ford,f250,,,diesel,,clean,automatic,,,,,,Clean Title 37s Cold AC Runs Great 196k 7.3 Diesel Best ...,fl,26.6204,-81.8725,10000
7046136964,syracuse,19999,2015,infiniti,qx60,excellent,6 cylinders,gas,50838.0,clean,automatic,5N1AL0MMXFC503546,4wd,,SUV,white,- 2015 Infiniti QX60 3.5-L V-6 DOHC 24V - Only 50838k on ...,ny,43.0847,-76.2405,18000


## Task One: Exploratory Data Analysis (10 minutes)

Please get a basic sense of the data and write down any relevant insights. <font color="gray">E.g., only X percent of the applicants are women, or most of the purpose of the loan is for Y.</font>

Please treat this document as a resource to be shared with your (imaginary) team.  You can use either comments or mardown cells to report your insights.

## Task Two: Build a Basic Model (30 minutes)

Please try to model the `price` of the car. We ask you to try to come up with an explanable model based on the exploration you just performed. Please try to limit your features in your final model. If you might have any intuitions about why a model is not working, please also record that.

* Feel free to use `sklearn` or whatever library that you are comfortable with E.g., `from sklearn.linear_model import LinearRegression, LogisticRegression`
* Feel free to use Pandas dataframe to pass into the libary, simply do `train_df.to_df()`, but we ask you to do the manipulation in our dataframe language
* You might need to do some feature engineering---e.g., turing categorical variables into numeric values, or to one-hot-encodings etc.

Here is some sample stub code to get you started
```
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
y = cleaned.select(['price'])
X = cleaned.select(['odometer', 'year'])
reg.fit(X.to_df(), y.to_df())
```