<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#data_acquisition">Data Acquisition</a>
    <li><a href="#basic_insight">Basic Insight of Dataset</a></li>
</ol>

</div>
<hr>

<h1 id="data_acquisition">Data Acquisition</h1>
<p>
There are various formats for a dataset, .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>
In this section, you will load a dataset into our Jupyter Notebook.<br>
In our case, the Automobile Dataset is an file called auto.csv, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>

<h2>Read Data</h2>


In [1]:
# Import pandas library
import pandas as pd

# Read the file auto.csv and assign it to variable "df"
df = pd.read_csv('auto.csv')


In [2]:
# show the first 5 rows
print("The first 5 rows of the dataframe:") 
print(df.head(5))


The first 5 rows of the dataframe:
   3    ?  alfa-romero  gas  std   two  convertible  rwd  front  88.6  ...  \
0  3    ?  alfa-romero  gas  std   two  convertible  rwd  front  88.6  ...   
1  1    ?  alfa-romero  gas  std   two    hatchback  rwd  front  94.5  ...   
2  2  164         audi  gas  std  four        sedan  fwd  front  99.8  ...   
3  2  164         audi  gas  std  four        sedan  4wd  front  99.4  ...   
4  2    ?         audi  gas  std   two        sedan  fwd  front  99.8  ...   

   130  mpfi  3.47  2.68   9.0  111  5000  21  27  13495  
0  130  mpfi  3.47  2.68   9.0  111  5000  21  27  16500  
1  152  mpfi  2.68  3.47   9.0  154  5000  19  26  16500  
2  109  mpfi  3.19  3.40  10.0  102  5500  24  30  13950  
3  136  mpfi  3.19  3.40   8.0  115  5500  18  22  17450  
4  136  mpfi  3.19  3.40   8.5  110  5500  19  25  15250  

[5 rows x 26 columns]


In [3]:
# Write your code below and press Shift+Enter to execute 
print('The bottom 10 rows: ')
print(df.tail(10))


The bottom 10 rows: 
     3    ? alfa-romero     gas    std   two convertible  rwd  front   88.6  \
194 -1   74       volvo     gas    std  four       wagon  rwd  front  104.3   
195 -2  103       volvo     gas    std  four       sedan  rwd  front  104.3   
196 -1   74       volvo     gas    std  four       wagon  rwd  front  104.3   
197 -2  103       volvo     gas  turbo  four       sedan  rwd  front  104.3   
198 -1   74       volvo     gas  turbo  four       wagon  rwd  front  104.3   
199 -1   95       volvo     gas    std  four       sedan  rwd  front  109.1   
200 -1   95       volvo     gas  turbo  four       sedan  rwd  front  109.1   
201 -1   95       volvo     gas    std  four       sedan  rwd  front  109.1   
202 -1   95       volvo  diesel  turbo  four       sedan  rwd  front  109.1   
203 -1   95       volvo     gas  turbo  four       sedan  rwd  front  109.1   

     ...  130  mpfi  3.47  2.68   9.0  111  5000  21  27  13495  
194  ...  141  mpfi  3.78  3.15   9.5  114 

<h3>Add Headers</h3>
<p>
Take a look at our dataset; pandas automatically set the header by an integer from 0.
</p>
<p>
To better describe our data we can introduce a header, this information is available at:  <a href="https://archive.ics.uci.edu/ml/datasets/Automobile" target="_blank">https://archive.ics.uci.edu/ml/datasets/Automobile</a>
</p>
<p>
Thus, we have to add headers manually.
</p>
<p>
Firstly, we create a list "headers" that include all column names in order.
</p>

In [4]:
# create headers list

headers = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-styles', 'drive-wheels', 'engine-location','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
df.columns = headers
print("headers\n", headers)

headers
 ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-styles', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']


 We replace headers and recheck our data frame by showing the first 10 rows 

In [5]:
# show the first 10 rows 
print(df.head(10))
print(df.columns)

   symboling normalized-losses         make fuel-type aspiration num-of-doors  \
0          3                 ?  alfa-romero       gas        std          two   
1          1                 ?  alfa-romero       gas        std          two   
2          2               164         audi       gas        std         four   
3          2               164         audi       gas        std         four   
4          2                 ?         audi       gas        std          two   
5          1               158         audi       gas        std         four   
6          1                 ?         audi       gas        std         four   
7          1               158         audi       gas      turbo         four   
8          0                 ?         audi       gas      turbo          two   
9          2               192          bmw       gas        std          two   

   body-styles drive-wheels engine-location  wheel-base  ...  engine-size  \
0  convertible          rwd    

we can drop missing values along the column "price" as follows  

In [6]:
# drop missing values only for the column price

dropmissingvalues = df.drop(df.index[df['price'] == '?']) 
print(dropmissingvalues)

     symboling normalized-losses         make fuel-type aspiration  \
0            3                 ?  alfa-romero       gas        std   
1            1                 ?  alfa-romero       gas        std   
2            2               164         audi       gas        std   
3            2               164         audi       gas        std   
4            2                 ?         audi       gas        std   
5            1               158         audi       gas        std   
6            1                 ?         audi       gas        std   
7            1               158         audi       gas      turbo   
9            2               192          bmw       gas        std   
10           0               192          bmw       gas        std   
11           0               188          bmw       gas        std   
12           0               188          bmw       gas        std   
13           1                 ?          bmw       gas        std   
14           0      

Now, we have successfully read the raw dataset and add the correct headers into the data frame.

In [7]:
# Write your code below and press Shift+Enter to execute 
df.columns



Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-styles', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

<h2>Save Dataset</h2>
<p>
Correspondingly, Pandas enables us to save the dataset to csv. You can add the file path and name along with quotation marks in the brackets.
</p>
<p>
    Save the dataframe <b>df</b> as <b>automobile.csv</b> to your local machine:
</p>

In [8]:
# Write your code below and press Shift+Enter to execute 
df.to_csv(r'C:\Users\roisi\Downloads\Assignment\automobile.csv') 

 We can also read and save other file formats, we can use similar functions. Please, check the slides for more information. 


<h1 id="basic_insight">Basic Insight of Dataset</h1>
<p>
After reading data into Pandas dataframe, it is time for us to explore the dataset.<br>
There are several ways to obtain essential insights of the data to help us better understand our dataset.
</p>

<h2>Data Types</h2>
<p>
Data has a variety of types.<br>
The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:
</p>

In [9]:
#show the types of all the columns 
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-styles           object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

<p>
As a result, as shown above, it is clear to see that the data type of "symboling" and "curb-weight" are <code>int64</code>, "normalized-losses" is <code>object</code>, and "wheel-base" is <code>float64</code>, etc.
</p>
<p>
These data types can be changed; we will learn how to accomplish this in a later module.
</p>

<h2>Describe</h2>
We would like to get a statistical summary of each column, such as count, column mean value, column standard deviation, etc.

Use a method in Pandas that provides various summary statistics, excluding <code>NaN</code> (Not a Number) values.

In [10]:
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,204.0,204.0,204.0,204.0,204.0,204.0,204.0,204.0,204.0,204.0
mean,0.823529,98.806373,174.075,65.916667,53.74902,2555.602941,126.892157,10.148137,25.240196,30.769608
std,1.239035,5.994144,12.362123,2.146716,2.424901,521.96082,41.744569,3.981,6.551513,6.898337
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.075,52.0,2145.0,97.0,8.575,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,119.5,9.0,24.0,30.0
75%,2.0,102.4,183.2,66.9,55.5,2939.25,142.0,9.4,30.0,34.5
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


<p>
This shows the statistical summary of all numeric-typed (int, float) columns.<br>
For example, the attribute "symboling" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.
<br>
However, what if we would also like to check all the columns including those that are of type object.
<br><br>

You can add an argument <code>include = "all"</code> inside the bracket. Let's try it again.
    
</p>

In [11]:
# describe all the columns in "df" 
df.describe(include='all')

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-styles,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,204.0,204,204,204,204,204,204,204,204,204.0,...,204.0,204,204.0,204.0,204.0,204.0,204.0,204.0,204.0,204
unique,,52,22,2,2,3,5,3,2,,...,,8,39.0,37.0,,60.0,24.0,,,186
top,,?,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,3.62,3.4,,68.0,5500.0,,,?
freq,,40,32,184,167,114,96,120,201,,...,,93,23.0,20.0,,19.0,37.0,,,4
mean,0.823529,,,,,,,,,98.806373,...,126.892157,,,,10.148137,,,25.240196,30.769608,
std,1.239035,,,,,,,,,5.994144,...,41.744569,,,,3.981,,,6.551513,6.898337,
min,-2.0,,,,,,,,,86.6,...,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,...,97.0,,,,8.575,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,...,119.5,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,...,142.0,,,,9.4,,,30.0,34.5,


<p>
Now, it provides the statistical summary of all the columns, including object-typed attributes.<br>
We can now see how many unique values, which is the top value and the frequency of top value in the object-typed columns.<br>
Some values in the table above show as "NaN", this is because those numbers are not available regarding a particular column type.<br>
</p>

In [14]:
# Write your code below and press Shift+Enter to execute 
df[['compression-ratio', 'length']].describe()



Unnamed: 0,compression-ratio,length
count,204.0,204.0
mean,10.148137,174.075
std,3.981,12.362123
min,7.0,141.1
25%,8.575,166.3
50%,9.0,173.2
75%,9.4,183.2
max,23.0,208.1


<h2>Info</h2>
Another method you can use to check your dataset is:

It provide a concise summary of your DataFrame.

In [13]:
# look at the info of "df"
df.info


<bound method DataFrame.info of      symboling normalized-losses         make fuel-type aspiration  \
0            3                 ?  alfa-romero       gas        std   
1            1                 ?  alfa-romero       gas        std   
2            2               164         audi       gas        std   
3            2               164         audi       gas        std   
4            2                 ?         audi       gas        std   
5            1               158         audi       gas        std   
6            1                 ?         audi       gas        std   
7            1               158         audi       gas      turbo   
8            0                 ?         audi       gas      turbo   
9            2               192          bmw       gas        std   
10           0               192          bmw       gas        std   
11           0               188          bmw       gas        std   
12           0               188          bmw       gas   

<p>
Here we are able to see the information of our dataframe, with the top 30 rows and the bottom 30 rows.
<br><br>
And, it also shows us the whole data frame has 205 rows and 26 columns in total.
</p>

<h1>Excellent! You have just completed the  Introduction  Notebook!</h1>

<h3>About the Authors:</h3>

This notebook was written by <a href="https://www.linkedin.com/in/mahdi-noorian-58219234/" target="_blank">Mahdi Noorian PhD</a>, <a href="https://www.linkedin.com/in/joseph-s-50398b136/" target="_blank">Joseph Santarcangelo</a>, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and <a href="https://www.linkedin.com/in/fiorellawever/" target="_blank">Fiorella Wenver</a> and <a href=" https://www.linkedin.com/in/yi-leng-yao-84451275/ " target="_blank" >Yi Yao</a>.

<p><a href="https://www.linkedin.com/in/joseph-s-50398b136/" target="_blank">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>

<hr>
<p>Copyright &copy; 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href="https://cognitiveclass.ai/mit-license/">MIT License</a>.</p>