Hay que tratar las siguientes variables categoricas

```python
['Census_ProcessorClass',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_PowerPlatformRoleName',
 'Census_InternalBatteryType',
 'Census_OSVersion',
 'Census_OSArchitecture',
 'Census_OSBranch',
 'Census_OSEdition',
 'Census_OSSkuName',
 'Census_OSInstallTypeName',
 'Census_OSWUAutoUpdateOptionsName',
 'Census_GenuineStateName',
 'Census_ActivationChannel',
 'Census_FlightRing']
```

# Imports

In [2]:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, LongType
from pyspark.sql.functions import *

from pyspark.ml.feature import StringIndexer

import multiprocessing

# Configuracion

In [None]:
cores = multiprocessing.cpu_count()
p = 10
conf = SparkConf()
conf.set("spark.driver.cores", cores)
conf.set("spark.driver.memory", "10g")
conf.set("spark.sql.shuffle.partitions", p * cores)
conf.set("spark.default.parallelism", p * cores)
sc = SparkContext(conf=conf)

# SparkSession

In [3]:
spark = SparkSession.builder.appName("Microsoft_Kaggle").getOrCreate()

# Data

In [8]:
data = spark.read.csv("data/train.csv", header=True, inferSchema=True)

## ProductName
Label enconding para variables categoricas

In [None]:
indexer = StringIndexer(inputCol="ProductName", outputCol="ProductNameIndex")
data = indexer.fit(data).transform(data)

## Census_PrimaryDiskTypeName
Label encoding para Census_PrimaryDiskTypeName.

In [None]:
data = data.fillna( { 'Census_PrimaryDiskTypeName':'UNKNOWN'} )
indexer = StringIndexer(inputCol="Census_PrimaryDiskTypeName", outputCol="Census_PrimaryDiskTypeNameIndex")
data = indexer.fit(data).transform(data)

## Census_ChassisTypeName
Frecuencia 


In [None]:
frequency_census = data.groupBy('Census_ChassisTypeName').count().withColumnRenamed('count','Census_ChassisTypeName_freq')
data = data.join(frequency_census,'Census_ChassisTypeName','left')

## Census_PowerPlatformRoleName

Label enconding para Census_PowerPlatformRoleName

In [13]:
data.groupBy("Census_PowerPlatformRoleName").count().show(5000,False)

+----------------------------+-------+
|Census_PowerPlatformRoleName|count  |
+----------------------------+-------+
|Unspecified                 |5      |
|UNKNOWN                     |20683  |
|SOHOServer                  |37841  |
|AppliancePC                 |4015   |
|Workstation                 |109683 |
|Slate                       |492537 |
|Mobile                      |6182908|
|EnterpriseServer            |7094   |
|Desktop                     |2066620|
|PerformanceServer           |97     |
+----------------------------+-------+



In [12]:
data = data.fillna( { 'Census_PowerPlatformRoleName':'UNKNOWN'} )
indexer = StringIndexer(inputCol="Census_PowerPlatformRoleName", outputCol="Census_PowerPlatformRoleNameIndex")
data = indexer.fit(data).transform(data)

## Census_InternalBatteryType
Frecuencia y booleana

In [19]:
data.groupBy("Census_InternalBatteryType").count().orderBy('count',ascending=False).show(100,False)

+--------------------------+-------+
|Census_InternalBatteryType|count  |
+--------------------------+-------+
|null                      |6338414|
|lion                      |2028256|
|li-i                      |245617 |
|#                         |183998 |
|lip                       |62099  |
|liio                      |32635  |
|li p                      |8383   |
|li                        |6708   |
|nimh                      |4614   |
|real                      |2744   |
|bq20                      |2302   |
|pbac                      |2274   |
|vbox                      |1454   |
|unkn                      |533    |
|lgi0                      |399    |
|lipo                      |198    |
|lhp0                      |182    |
|4cel                      |170    |
|lipp                      |83     |
|ithi                      |79     |
|batt                      |60     |
|ram                       |35     |
|virt                      |33     |
|bad                       |33     |
|

### frecuencia

In [25]:
frequency_census = data.groupBy('Census_InternalBatteryType').count().withColumnRenamed('count','Census_InternalBatteryType_freq')
data = data.join(frequency_census,'Census_InternalBatteryType','left')

### booleana

In [26]:
data = data.withColumn('Census_InternalBatteryType_informed',when(col('Census_InternalBatteryType').isNotNull(),1).otherwise(0))