<a href="https://colab.research.google.com/github/sankarramamurthy/Data-Analytics/blob/main/Data_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Useful Libraries

**Python Scientific computing libraries**
*   Pandas - data structures & tools, 2D table - dataframe, easy indexing
*   Numpy - Arrays and matrices
*   Sci-py - Integrals, solving differential equations, optimization

**Data Visualization Libraries**
*   Matplotlib - plots & graphs, most popular
*   Seaborn - heat maps, time series, violin plots

**Algorithmic Libraries**
*   Scikit-learn - ML: regression, classification etc.; built on Numpy, Sci-Py & Matplotlib
*   Statsmodel - Explore data, estimate statistical models, perform staistical tests



# Reading & Writing Data

**Data formats for Reading & Writing to files**
| Data Format | Read | Save |
| -------- | -------- | -------- |
| csv | pd.read_csv() | df.to_csv() |
| json | pd.read_json() | df.to_json() |
| excel | pd.read_excel() | df.to_excel() |
| sql | pd.read_sql() | df.to_sql() |


# Importing Dataset & Check the basic stats

In [1]:
import pandas as pd
# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
df = pd.read_csv(other_path, header=None)
print(df.head(5)) # The first 5 rows of the dataframe
print(df.tail(5)) # The last 5 rows of the dataframe


   0    1            2    3    4     5            6    7      8     9   ...  \
0   3    ?  alfa-romero  gas  std   two  convertible  rwd  front  88.6  ...   
1   3    ?  alfa-romero  gas  std   two  convertible  rwd  front  88.6  ...   
2   1    ?  alfa-romero  gas  std   two    hatchback  rwd  front  94.5  ...   
3   2  164         audi  gas  std  four        sedan  fwd  front  99.8  ...   
4   2  164         audi  gas  std  four        sedan  4wd  front  99.4  ...   

    16    17    18    19    20   21    22  23  24     25  
0  130  mpfi  3.47  2.68   9.0  111  5000  21  27  13495  
1  130  mpfi  3.47  2.68   9.0  111  5000  21  27  16500  
2  152  mpfi  2.68  3.47   9.0  154  5000  19  26  16500  
3  109  mpfi  3.19  3.40  10.0  102  5500  24  30  13950  
4  136  mpfi  3.19  3.40   8.0  115  5500  18  22  17450  

[5 rows x 26 columns]
     0   1      2       3      4     5      6    7      8      9   ...   16  \
200  -1  95  volvo     gas    std  four  sedan  rwd  front  109.1  ..

In [2]:
# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers
print(df.head(5))

   symboling normalized-losses         make fuel-type aspiration num-of-doors  \
0          3                 ?  alfa-romero       gas        std          two   
1          3                 ?  alfa-romero       gas        std          two   
2          1                 ?  alfa-romero       gas        std          two   
3          2               164         audi       gas        std         four   
4          2               164         audi       gas        std         four   

    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \
0  convertible          rwd           front        88.6  ...          130   
1  convertible          rwd           front        88.6  ...          130   
2    hatchback          rwd           front        94.5  ...          152   
3        sedan          fwd           front        99.8  ...          109   
4        sedan          4wd           front        99.4  ...          136   

   fuel-system  bore  stroke compression-ratio hor

In [6]:
#drop missing values along the column "price" using the dropna method
#axis=0 means remove the entire row; axis=1 means remove the column that contains missing data
#inplace = True - indicates to write the result back to the df
df.dropna(subset=["price"], axis=0, inplace=True)
#display the columns of the df
print(df.columns)

#save the dataset to csv
df.to_csv("automobile.csv", index=False)

#return all the column names along with its datatype
df.dtypes

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')


Unnamed: 0,0
symboling,int64
normalized-losses,object
make,object
fuel-type,object
aspiration,object
num-of-doors,object
body-style,object
drive-wheels,object
engine-location,object
wheel-base,float64


In [7]:
#provide various summary statistics, excluding NaN (Not a Number) values.
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


In [8]:
#describe for selected columns
df[['length', 'compression-ratio']].describe()

Unnamed: 0,length,compression-ratio
count,205.0,205.0
mean,174.049268,10.142537
std,12.337289,3.97204
min,141.1,7.0
25%,166.3,8.6
50%,173.2,9.0
75%,183.1,9.4
max,208.1,23.0


In [9]:
#give a concise summary of the df
df.info