<img src="Apache_Spark_logo.svg.png" height="300" width="300">

# Packages

In [2]:
import pandas as pd
pd.set_option("display.max.columns", 50)
import findspark
findspark.init()

In [23]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType
from pyspark.sql.functions import countDistinct

In [17]:
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

# Spark Session

In [4]:
spark = (SparkSession.builder
                     .appName("PysparkSession")
                     .getOrCreate())

# DataFrames

## Empty DataFrame

### Pandas

In [7]:
pd.DataFrame()

In [8]:
cols = ["col_1", "col_2", "col_3"]
df_pd = pd.DataFrame(columns = cols)
df_pd

Unnamed: 0,col_1,col_2,col_3


### Pyspark

In [13]:
df_spark = spark.createDataFrame(df_pd, schema = StructType([]))
df_spark.show()

++
||
++
++



In [14]:
df_spark = spark.createDataFrame(data = [], schema = StructType([]))
df_spark.show()

++
||
++
++



In [15]:
schema = StructType([
    StructField("col_1", StringType(), True),
    StructField("col_2", FloatType(), True),
    StructField("col_3", IntegerType(), True),
])

df_spark = spark.createDataFrame(data = [], schema = schema)
df_spark.show()

+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
+-----+-----+-----+



## Count Distinct

In [18]:
df_auto = spark.read.format("parquet")\
                    .option("header", True)\
                    .load("./datasets/Automobile_data.parquet")
df_auto.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

In [22]:
df_auto.select("make").distinct().count()

22

In [26]:
df_auto.select(countDistinct("make").alias("count_distinct_make")).show()
df_auto.select(countDistinct("make").alias("count_distinct_make")).collect()[0].__getitem__("count_distinct_make")

+-------------------+
|count_distinct_make|
+-------------------+
|                 22|
+-------------------+



22

# Pyspark vs Pandas

El aumento de los datos es inevitable y en los ultimos años esto ha sido evidente, lo cual aumenta la cantidad de tiempo 
de procesamiento de dichos datos, aunque existen herramientas que logran analizar y trabajar con dichos datos, tales como 
`Python` con `Pandas`, sin embargo, `Pandas` carga su marco de datos en la memoria, lo cual ocasiona que al tener grandes
volumenes de datos, la memoria del PC no logra soprtar tanta información. Actualmente existen herramientas como 
`Apache Spark`, que al realizar una programación distribuida, logra obtener grandes procesamientos de datos a una gran 
velodicad, aunque sin dejar de lado que `Apache Spark`no e sun lenguaje de programación y que se pude ver limitado en
algunos aspectos.

## Cuando utilizar Pyspark en lugar de Pandas

Saber cuando utilizar una herramienta o la otra, puede ser comlicado, inclusive hasta el punto en las que algunos 
Científicos de Datos asumen que `Pyspark` es mejor que `Pandas`, pero esto no es del toso cierto,sin embargo, acá podemos 
ver una comparación sobre la cnatidad de GB que se puden caragr con `Pandas` y `Pyspark`:

<img src="sp_1JPG.jpg" height="600" width="600">

**Nota:** la prueba se realizó bajo las siguientes especificaciones de Hardware:

- **Número de núcleos de CPU:** 32 núcleos virtuales (16 núcleos físicos), CPU Intel Xeon E5–2686 v4 @ 2.30GHz
- **Memoria del sistema:** 244 GB
- **Disco local:** 4 SSD NVMe de 1900 GB

En teoría `Pyspark` es hasta 100 veces más rápido que `Pandas`, sin embargo, tienen el mismo tiempo de ejecución para 
hasta cierta cantidad de GB, aunque, cuando se supera cierto límite de GB, sus tiempos de ejecución cambian, y es que
Pandas se ve limitado por la memoria, a conntinuación, podemos observar la comapración en la siguiente gráfica:

<img src="sp_2.jpg" height="950" width="950">

Para la fase inicial de hasta 20 GB, están teniendo la misma pendiente, pero a medida que aumenta el tamaño del archivo, 
`Pandas` se queda sin memoria y `PySpark` pudo completar el trabajo con éxito.

Aunque `Pyspark` resultó siendo más rápiado que `Pandas` para grandes volúmenes de datos, al tener conjuntos de datos 
pequeños de 10-12 GB, puede preferir `Pandas` sobre `PySpark`, debido al mismo tiempo de ejecución y menos complejidad, y 
por encima de eso, debe trabajar con `PySpark`.

# Artificial Intelligence

<br>
<img src="IA.jpg" height = "750" width = "750">
<br>

## Machine Learning

<img src="MLearning.jpg" height = "500" width = "500">

El Machine Learning es una rama de la Inteligencia Artificial que, a través de algoritmos, dota a las "máquinas" 
la capacidad de identificar patrones en datos (o la capacidad de "aprender") y realizar predicciones.
Este aprendizaje permite a las máquinas realizar tareas específicas como clasificación, forecasting, 
sistemas de recomendación y entre otros.

Podemos utilizar el ML para:

- Problemas para los cuales las soluciones existentes requieren una gran cantidad de trabajo.
- Problemas complejos para los que no existe una solución con los métodos tradicionales.
- Entornos fluctuantes.
- Obtener información sobre problemas complejos y grandes cantidades de datos.

Python y R son lenguajes de programación mas populares para el análisis de datos, machine leraning, IA, y deep learning

<img src="R.jpg" height = "300" width = "300">

## Tipos de algoritmos en ML
Los sistemas de machine learning los clasificaremos de acuerdo a la relación que se establezca entre los datos de entrada y salida, además de la naturaleza continua o discreta de los datos. A continuación, podemos ver algunos de ellos:
<br>
<br>
<img src="ML.png" height = "800" width = "800">
<br>

## ML Build

<img src="ml_2.jfif" height = "650" width = "650">

# ML end-to-end
<br>
<br>
<img src="MLend.jpg" height = "750" width = "750">

# EDA

## Pyspark

In [27]:
df_auto = spark.read.format("parquet")\
                    .option("header", True)\
                    .load("./datasets/Automobile_data.parquet")
df_auto.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

In [45]:
from pyspark.sql.functions import col, rand, isnan, when, count, avg
from pyspark.ml.stat import Correlation

### Dtypes

In [30]:
df_auto.printSchema()

root
 |-- symboling: long (nullable = true)
 |-- normalized-losses: double (nullable = true)
 |-- make: string (nullable = true)
 |-- fuel-type: string (nullable = true)
 |-- aspiration: string (nullable = true)
 |-- num-of-doors: string (nullable = true)
 |-- body-style: string (nullable = true)
 |-- drive-wheels: string (nullable = true)
 |-- engine-location: string (nullable = true)
 |-- wheel-base: double (nullable = true)
 |-- length: double (nullable = true)
 |-- width: double (nullable = true)
 |-- height: double (nullable = true)
 |-- curb-weight: long (nullable = true)
 |-- engine-type: string (nullable = true)
 |-- num-of-cylinders: string (nullable = true)
 |-- engine-size: long (nullable = true)
 |-- fuel-system: string (nullable = true)
 |-- bore: double (nullable = true)
 |-- stroke: double (nullable = true)
 |-- compression-ratio: double (nullable = true)
 |-- horsepower: double (nullable = true)
 |-- peak-rpm: double (nullable = true)
 |-- city-mpg: long (nullable = tru

### Describe

In [32]:
df_auto.describe().show()

+-------+------------------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+------------------+------------------+-----------------+------------------+------------------+-----------+----------------+------------------+-----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+
|summary|         symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|        wheel-base|            length|            width|            height|       curb-weight|engine-type|num-of-cylinders|       engine-size|fuel-system|              bore|            stroke| compression-ratio|        horsepower|         peak-rpm|         city-mpg|      highway-mpg|             price|
+-------+------------------+-----------------+-----------+---------+----------+------------+-----------+------------+---------

In [33]:
df_auto.toPandas().describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
symboling,205.0,0.834146,1.245307,-2.0,0.0,1.0,2.0,3.0
normalized-losses,164.0,122.0,35.442168,65.0,94.0,115.0,150.0,256.0
wheel-base,205.0,98.756585,6.021776,86.6,94.5,97.0,102.4,120.9
length,205.0,174.049268,12.337289,141.1,166.3,173.2,183.1,208.1
width,205.0,65.907805,2.145204,60.3,64.1,65.5,66.9,72.3
height,205.0,53.724878,2.443522,47.8,52.0,54.1,55.5,59.8
curb-weight,205.0,2555.565854,520.680204,1488.0,2145.0,2414.0,2935.0,4066.0
engine-size,205.0,126.907317,41.642693,61.0,97.0,120.0,141.0,326.0
bore,201.0,3.329751,0.273539,2.54,3.15,3.31,3.59,3.94
stroke,201.0,3.255423,0.316717,2.07,3.11,3.29,3.41,4.17


In [35]:
df_auto.describe(["horsepower"]).show()

+-------+------------------+
|summary|        horsepower|
+-------+------------------+
|  count|               203|
|   mean|104.25615763546799|
| stddev| 39.71436878679357|
|    min|              48.0|
|    max|             288.0|
+-------+------------------+



### Handling Missing Values

In [41]:
df_auto.select([
    count(when(col(column).isNull(), column)).alias(column) for column in df_auto.columns
]).show()

+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|make|fuel-type|aspiration|num-of-doors|body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        0|               41|   0|        0|         0|           2|         0|           0|              0|         0|     0|    0|     0|          0|   

In [42]:
df_auto.describe(["normalized-losses"]).show()

+-------+-----------------+
|summary|normalized-losses|
+-------+-----------------+
|  count|              164|
|   mean|            122.0|
| stddev|35.44216753055326|
|    min|             65.0|
|    max|            256.0|
+-------+-----------------+



#### Fillna

In [49]:
media = df_auto.select(avg("normalized-losses").alias("mean")).collect()[0].__getitem__("mean")
df_auto = df_auto.na.fill(value = media, subset = "normalized-losses")

In [50]:
df_auto.select([
    count(when(col(column).isNull(), column)).alias(column) for column in df_auto.columns
]).show()

+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|make|fuel-type|aspiration|num-of-doors|body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        0|                0|   0|        0|         0|           2|         0|           0|              0|         0|     0|    0|     0|          0|   

#### Dropna

In [65]:
df_auto = df_auto.na.drop()
df_auto.select([
    count(when(col(column).isNull(), column)).alias(column) for column in df_auto.columns
]).show()

+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|make|fuel-type|aspiration|num-of-doors|body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+----+---------+----------+------------+----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        0|                0|   0|        0|         0|           0|         0|           0|              0|         0|     0|    0|     0|          0|   

In [67]:
df_auto.show()

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|            122.0|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

### Statistics

In [69]:
df_auto.stat.corr("engine-size", "horsepower")

0.8453249175361403

## Pandas

In [52]:
df_auto_pd = pd.read_parquet("./datasets/Automobile_data.parquet")
df_auto_pd.head(3)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0


### dtypes

In [56]:
pd.DataFrame(df_auto_pd.dtypes).T

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,int64,float64,object,object,object,object,object,object,object,float64,float64,float64,float64,int64,object,object,int64,object,float64,float64,float64,float64,float64,int64,int64,float64


### Describe

In [None]:
df_describe = df_auto_pd.describe().T
df_describe["CV"] = (df_describe["std"]/abs(df_describe["mean"]))*100
df_describe

### Handling Missing Values

In [63]:
pd.DataFrame(df_auto_pd.isnull().sum()).T

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,0,41,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,4,4,0,2,2,0,0,4


#### Fillna

In [64]:
df_auto_pd["normalized-losses"] = df_auto_pd["normalized-losses"].fillna(df_auto_pd["normalized-losses"].mean())
pd.DataFrame(df_auto_pd.isnull().sum()).T

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,4,4,0,2,2,0,0,4


#### Dropna

In [66]:
df_auto_pd = df_auto_pd.dropna()
pd.DataFrame(df_auto_pd.isnull().sum()).T

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Build ML

In [71]:
df_auto.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|            122.0|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

## Featurization

El proceso de caracterización, incluidas las tareas comunes, como el manejo de características categóricas y 
la normalización, la imputación de datos faltantes y la creación de una canalización de pasos de caracterización.

- **Transformer:** Aplica una transformacion a una columna (codificar datos categoricos).
- **Estimator:** Se utiliza para entrenar un modelo.
- **Pipeline:** Es aquel que ejecuta los pasos para la normalizacion, featurization, y contrución de un modelo

<img src="pipeline.jpg" height="750" width="750">

### Pyspark

In [77]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

In [80]:
df_make = df_auto.select("make").distinct()
df_make.show()

+-------------+
|         make|
+-------------+
|       peugot|
|       jaguar|
|   mitsubishi|
|       toyota|
|         saab|
|     plymouth|
|         audi|
|  alfa-romero|
|          bmw|
|        dodge|
|        mazda|
|mercedes-benz|
|        isuzu|
|      porsche|
|    chevrolet|
|        honda|
|   volkswagen|
|      mercury|
|       nissan|
|       subaru|
+-------------+
only showing top 20 rows



In [91]:
indexer = StringIndexer(inputCol = "make", outputCol = "make_index")
indexerModel = indexer.fit(df_make)
index_df = indexerModel.transform(df_make)
index_df.show(25)

+-------------+----------+
|         make|make_index|
+-------------+----------+
|       peugot|      13.0|
|       jaguar|       7.0|
|   mitsubishi|      11.0|
|       toyota|      18.0|
|         saab|      16.0|
|     plymouth|      14.0|
|         audi|       1.0|
|  alfa-romero|       0.0|
|          bmw|       2.0|
|        dodge|       4.0|
|        mazda|       8.0|
|mercedes-benz|       9.0|
|        isuzu|       6.0|
|      porsche|      15.0|
|    chevrolet|       3.0|
|        honda|       5.0|
|   volkswagen|      19.0|
|      mercury|      10.0|
|       nissan|      12.0|
|       subaru|      17.0|
|        volvo|      20.0|
+-------------+----------+



In [86]:
# encoder = OneHotEncoder(inputCol = "make_index", outputCol = "make_encoder")
# encoderModel = encoder.fit(index_df)
# encoder_df = encoderModel.transform(index_df)
# encoder_df.show(21)

### Pandas

In [88]:
from sklearn.preprocessing import LabelEncoder

In [87]:
df_auto_pd.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [90]:
labelencoder = LabelEncoder()
df_cat = df_auto_pd.select_dtypes(include = "O")
df_cat = df_cat.apply(labelencoder.fit_transform)
df_cat

Unnamed: 0,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system
0,0,1,0,1,0,2,0,0,2,4
1,0,1,0,1,0,2,0,0,2,4
2,0,1,0,1,2,2,0,4,3,4
3,1,1,0,0,3,1,0,2,2,4
4,1,1,0,0,3,0,0,2,1,4
...,...,...,...,...,...,...,...,...,...,...
200,20,1,0,0,3,2,0,2,2,4
201,20,1,1,0,3,2,0,2,2,4
202,20,1,0,0,3,2,0,4,3,4
203,20,0,1,0,3,2,0,2,3,2


In [92]:
df_num = df_auto_pd.select_dtypes(exclude = "O")
df_num

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,122.0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,109.1,188.8,68.9,55.5,2952,141,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,109.1,188.8,68.8,55.5,3049,141,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,109.1,188.8,68.9,55.5,3012,173,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,109.1,188.8,68.9,55.5,3217,145,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


In [93]:
df_end = pd.concat([df_num, df_cat], axis = 1)
df_end

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system
0,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0,0,1,0,1,0,2,0,0,2,4
1,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0,0,1,0,1,0,2,0,0,2,4
2,1,122.0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0,0,1,0,1,2,2,0,4,3,4
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0,1,1,0,0,3,1,0,2,2,4
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0,1,1,0,0,3,0,0,2,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,109.1,188.8,68.9,55.5,2952,141,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0,20,1,0,0,3,2,0,2,2,4
201,-1,95.0,109.1,188.8,68.8,55.5,3049,141,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0,20,1,1,0,3,2,0,2,2,4
202,-1,95.0,109.1,188.8,68.9,55.5,3012,173,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0,20,1,0,0,3,2,0,4,3,4
203,-1,95.0,109.1,188.8,68.9,55.5,3217,145,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0,20,0,1,0,3,2,0,2,3,2
