# Data Science: Analysis, Visualization, and Machine Learning with Python

This notebook showcases the skills in Data Science that I've enhanced and acquired with the help of IBM's Cognitive Classes.  
The workflow processes and examples in this notebook are accurate to the style that I implement in a professional setting.

# Data Analysis: The Prices of Cars

Using Data Analysis, we can figure out the max price a car can be realistically sold for.  
The dataset we will be using to aid our analysis, is an open dataset named **"1985 Auto Imports Database"** provided by Jeffrey C. Schlimmer.  
The data can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/autos, as filename: **"import-85.data"**.

Below is a sample of the dataset viewed in a text editor:

![test_title1](Picture_of_Cars_Dataset.png)

We can see that this dataset is structured in CSV format, comma-separated values.  
Each row represents a new observation, while each comma separates one column (or variable) from the next.  
This is a highly common data structure that is incredibly easy to import and work with.

There is also another helpful file in the open directory named **"import-85.names"**, which contains further details about the dataset.  
IBM Cognitive Class was kind enough to compile the column header info into a more digestable format as can be seen here:

![test_title2](Picture_of_Cars_Names.png)

The 2 columns that the **"import-85.names"** file brings special attention to (besides the very important **price** column in No. 26), are No. 1 and 2: **symboling** and **normalized-losses**.

**Symboling:** 
**The degree to which the auto is more risky than its price indicates.**   
Cars are initially assigned a risk factor symbol associated with its price.
Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale.  
Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

**Normalized-losses:**
**The relative average loss payment per insured vehicle year.**   
This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

Now that we have a basic understanding of our data, we are ready to import our data.

# Data Import

We will import the data using the **Pandas Library**, which offers data structure and tools for effective data manipulation and analysis.  
We will do this in 3 steps:
1. Import the Pandas Library
2. Create a path variable to hold our data source location
3. Use the **read_csv method** to import our data in a dataframe named 'df'

In [8]:
import pandas as pd
path="Desktop/imports-85.data"
df=pd.read_csv(path, header=None) #specify that there is no header in the dataset

After importing a dataset, it's good practice to print the set (or sample of) to make sure nothing went wrong.  
In most cases, using **dataframe.head(n)** to show the first n rows is a sufficient check.  
Let's check the first 5 rows:

In [9]:
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


The dataset was imported successfully, however since we did not have headers in our dataset, default integer headers were substituted instead (which is not very helpful or easy to memorize).  
Using the header names we found in the helpful **"import-85.names"** file, we can create a list that holds the headers.  
Then, we can replace the default integer column names using the list.

In [10]:
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels",
         "engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size",
         "fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]

In [11]:
df.columns=headers
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


We now have the proper headers for our dataset, and can move on to wrangling our data.

# Data Wrangling

Before we can do actual analysis on our data, we have to make sure that our data is properly wrangled, or pre-processed.  
For example, the dataset could have incorrect data types, missing values, misformatted data, and so forth.  
Let's start with missing values.

### Dealing with Missing Values - 5 Ways

There are many ways to deal with missing values, which include:

1. Check if the data source author knows what the value should be
2. Replace the missing data with the average or most common value of the variable
3. Remove the whole observation with the missing value (this is usually best if not many observations are missing values)
4. Remove the whole variable with the missing value
5. Leave the missing data as is

Though every analysis is different and there are always exceptions, I've listed these methods in the general order of preference.

Let's take a look at how to drop observations with missing values.

### Dealing with Missing Values - Dropping Observations

We can use the **"dropna" method** to drop rows (as well as columns) that contain missing values.  
In this case, since our analysis is focusing on the column **"price"**, we will drop rows that are missing a price value.

In [15]:
df.dropna(subset=["price"], axis=0, inplace=True) #axis=0 affects rows, inplace=True modifies the original dataset

### Dealing with Missing Values - Replacing Missing Values

We can use the **replace method** to do exactly as it sounds. If we wanted to replace missing values in our **"curb-weight"** column (with the average), we first calculate what the average is. Once we have the average stored in a variable, we can replace every missing value with it.

In [17]:
import numpy as np
mean=df["curb-weight"].mean()
df["curb-weight"].replace(np.nan, mean)

0      2548
1      2548
2      2823
3      2337
4      2824
5      2507
6      2844
7      2954
8      3086
9      3053
10     2395
11     2395
12     2710
13     2765
14     3055
15     3230
16     3380
17     3505
18     1488
19     1874
20     1909
21     1876
22     1876
23     2128
24     1967
25     1989
26     1989
27     2191
28     2535
29     2811
       ... 
175    2414
176    2414
177    2458
178    2976
179    3016
180    3131
181    3151
182    2261
183    2209
184    2264
185    2212
186    2275
187    2319
188    2300
189    2254
190    2221
191    2661
192    2579
193    2563
194    2912
195    3034
196    2935
197    3042
198    3045
199    3157
200    2952
201    3049
202    3012
203    3217
204    3062
Name: curb-weight, Length: 205, dtype: int64

Pandas automatically assigns data types based on the encoding it detects from the original data table.  
For a number of reasons, this assignment may not always be correct and could cause problems.  
For example: if you wanted to perform a math function (like sum or divide) to a column like "weight" but it was accidently assigned as an object or character column, then it would not be possible.  
We can apply the **dtype** method to return the data type of each column and see if something is not as we expected.

In [14]:
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

As we read down the output, most of the data types make sense, however some do not. Some examples include: **"bore"** (which is an engine dimension yet is assigned as an **object**), and BLANK