Hay que tratar las siguientes variables categoricas

```python
['Census_ProcessorClass',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_PowerPlatformRoleName',
 'Census_InternalBatteryType',
 'Census_OSVersion',
 'Census_OSArchitecture',
 'Census_OSBranch',
 'Census_OSEdition',
 'Census_OSSkuName',
 'Census_OSInstallTypeName',
 'Census_OSWUAutoUpdateOptionsName',
 'Census_GenuineStateName',
 'Census_ActivationChannel',
 'Census_FlightRing']
```

# Indice
1. [Configuracion](#Configuracion)
2. [ProductName](#ProductName)


# Imports

In [1]:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, LongType
from pyspark.sql.functions import *

from pyspark.ml.feature import StringIndexer

import multiprocessing

# Configuracion

In [None]:
cores = multiprocessing.cpu_count()
p = 10
conf = SparkConf()
conf.set("spark.driver.cores", cores)
conf.set("spark.driver.memory", "10g")
conf.set("spark.sql.shuffle.partitions", p * cores)
conf.set("spark.default.parallelism", p * cores)
sc = SparkContext(conf=conf)

# SparkSession

In [2]:
spark = SparkSession.builder.appName("Microsoft_Kaggle").getOrCreate()

# Data

In [3]:
data = spark.read.csv("data/train.csv", header=True, inferSchema=True)

## ProductName
Label enconding para variables categoricas

In [None]:
indexer = StringIndexer(inputCol="ProductName", outputCol="ProductNameIndex")
data = indexer.fit(data).transform(data)

## Census_PrimaryDiskTypeName
Label encoding para Census_PrimaryDiskTypeName.

In [None]:
data = data.fillna( { 'Census_PrimaryDiskTypeName':'UNKNOWN'} )
indexer = StringIndexer(inputCol="Census_PrimaryDiskTypeName", outputCol="Census_PrimaryDiskTypeNameIndex")
data = indexer.fit(data).transform(data)

## Census_ChassisTypeName
Frecuencia 


In [None]:
frequency_census = data.groupBy('Census_ChassisTypeName').count().withColumnRenamed('count','Census_ChassisTypeName_freq')
data = data.join(frequency_census,'Census_ChassisTypeName','left')

## Census_PowerPlatformRoleName

Label enconding para Census_PowerPlatformRoleName

In [13]:
data.groupBy("Census_PowerPlatformRoleName").count().show(5000,False)

+----------------------------+-------+
|Census_PowerPlatformRoleName|count  |
+----------------------------+-------+
|Unspecified                 |5      |
|UNKNOWN                     |20683  |
|SOHOServer                  |37841  |
|AppliancePC                 |4015   |
|Workstation                 |109683 |
|Slate                       |492537 |
|Mobile                      |6182908|
|EnterpriseServer            |7094   |
|Desktop                     |2066620|
|PerformanceServer           |97     |
+----------------------------+-------+



In [12]:
data = data.fillna( { 'Census_PowerPlatformRoleName':'UNKNOWN'} )
indexer = StringIndexer(inputCol="Census_PowerPlatformRoleName", outputCol="Census_PowerPlatformRoleNameIndex")
data = indexer.fit(data).transform(data)

## Census_InternalBatteryType
Frecuencia y booleana

In [19]:
data.groupBy("Census_InternalBatteryType").count().orderBy('count',ascending=False).show(100,False)

+--------------------------+-------+
|Census_InternalBatteryType|count  |
+--------------------------+-------+
|null                      |6338414|
|lion                      |2028256|
|li-i                      |245617 |
|#                         |183998 |
|lip                       |62099  |
|liio                      |32635  |
|li p                      |8383   |
|li                        |6708   |
|nimh                      |4614   |
|real                      |2744   |
|bq20                      |2302   |
|pbac                      |2274   |
|vbox                      |1454   |
|unkn                      |533    |
|lgi0                      |399    |
|lipo                      |198    |
|lhp0                      |182    |
|4cel                      |170    |
|lipp                      |83     |
|ithi                      |79     |
|batt                      |60     |
|ram                       |35     |
|virt                      |33     |
|bad                       |33     |
|

### frecuencia

In [25]:
frequency_census = data.groupBy('Census_InternalBatteryType').count().withColumnRenamed('count','Census_InternalBatteryType_freq')
data = data.join(frequency_census,'Census_InternalBatteryType','left')

### booleana

In [26]:
data = data.withColumn('Census_InternalBatteryType_informed',when(col('Census_InternalBatteryType').isNotNull(),1).otherwise(0))

## Census_OSVersion

In [28]:
#data.groupBy("Census_OSVersion").count().orderBy('count',ascending=False).show(100,False)

data.groupBy("Census_OSVersion").count().count()

469

In [32]:
data = data.withColumn('Census_OSVersion_0', split(data['Census_OSVersion'], '\.')[0])\
        .withColumn('Census_OSVersion_1', split(data['Census_OSVersion'], '\.')[1])\
        .withColumn('Census_OSVersion_2', split(data['Census_OSVersion'], '\.')[2].cast(IntegerType()))\
        .withColumn('Census_OSVersion_3', split(data['Census_OSVersion'], '\.')[3].cast(IntegerType()))

In [35]:
OSVersion = data.select('Census_OSVersion','Census_OSVersion_0','Census_OSVersion_1','Census_OSVersion_2','Census_OSVersion_3')

OSVersion.show()

+----------------+------------------+------------------+------------------+------------------+
|Census_OSVersion|Census_OSVersion_0|Census_OSVersion_1|Census_OSVersion_2|Census_OSVersion_3|
+----------------+------------------+------------------+------------------+------------------+
|  10.0.17134.228|                10|                 0|             17134|               228|
| 10.0.14393.1198|                10|                 0|             14393|              1198|
|  10.0.17134.165|                10|                 0|             17134|               165|
|  10.0.17134.112|                10|                 0|             17134|               112|
|  10.0.17134.165|                10|                 0|             17134|               165|
|  10.0.16299.125|                10|                 0|             16299|               125|
|  10.0.17134.165|                10|                 0|             17134|               165|
|  10.0.17134.228|                10|             

In [46]:
OSVersion.persist()

DataFrame[Census_OSVersion: string, Census_OSVersion_0: string, Census_OSVersion_1: string, Census_OSVersion_2: int, Census_OSVersion_3: int]

In [40]:
OSVersion.groupBy('Census_OSVersion_0').count().show(100000,False)

+------------------+-------+
|Census_OSVersion_0|count  |
+------------------+-------+
|6                 |20     |
|10                |8921463|
+------------------+-------+



In [41]:
OSVersion.groupBy('Census_OSVersion_1').count().show(100000,False)

+------------------+-------+
|Census_OSVersion_1|count  |
+------------------+-------+
|3                 |11     |
|0                 |8921463|
|1                 |5      |
|2                 |4      |
+------------------+-------+



In [49]:
OSVersion.groupBy('Census_OSVersion_2').count().orderBy('count',ascending=False).show(100000,False)

+------------------+-------+
|Census_OSVersion_2|count  |
+------------------+-------+
|17134             |4008881|
|16299             |2443249|
|15063             |797049 |
|14393             |785450 |
|10586             |593527 |
|10240             |271604 |
|17692             |3096   |
|17738             |3062   |
|17744             |2372   |
|17758             |1703   |
|17746             |1220   |
|17754             |1086   |
|17763             |1063   |
|17751             |1006   |
|17735             |980    |
|17741             |814    |
|17755             |684    |
|17760             |590    |
|17686             |556    |
|17733             |524    |
|17672             |351    |
|17677             |304    |
|17133             |253    |
|17682             |248    |
|18234             |233    |
|17666             |203    |
|18237             |173    |
|18242             |142    |
|17713             |126    |
|17661             |122    |
|17650             |61     |
|17639        

In [45]:
OSVersion.groupBy('Census_OSVersion_2').count().count()

165

In [50]:
OSVersion.groupBy('Census_OSVersion_3').count().orderBy('count',ascending=False).show(100000,False)

+------------------+-------+
|Census_OSVersion_3|count  |
+------------------+-------+
|228               |1413633|
|165               |899712 |
|431               |546546 |
|285               |470280 |
|547               |346853 |
|112               |346488 |
|371               |325267 |
|191               |228256 |
|2189              |223775 |
|611               |216776 |
|125               |213342 |
|17443             |206843 |
|1176              |182087 |
|492               |168878 |
|0                 |166369 |
|309               |151196 |
|286               |139040 |
|15                |117555 |
|254               |112344 |
|1                 |106585 |
|1206              |102275 |
|1266              |101237 |
|192               |99068  |
|167               |86787  |
|248               |77476  |
|137               |75873  |
|48                |66266  |
|1088              |63274  |
|81                |55384  |
|693               |50955  |
|1155              |46052  |
|164          

In [44]:
OSVersion.groupBy('Census_OSVersion_3').count().count()

285

In [47]:
OSVersion.groupBy('Census_OSVersion_3').count().describe().show()

+-------+------------------+-----------------+
|summary|Census_OSVersion_3|            count|
+-------+------------------+-----------------+
|  count|               285|              285|
|   mean| 5200.571929824561|31303.44912280702|
| stddev| 7716.680269813776|117182.5170191589|
|    min|                 0|                1|
|    max|             41736|          1413633|
+-------+------------------+-----------------+



## Census_OSArchitecture 
Label enconding clarisimo

In [53]:
data.groupBy("Census_OSArchitecture").count().show(10,False)

+---------------------+-------+
|Census_OSArchitecture|count  |
+---------------------+-------+
|x86                  |815252 |
|arm64                |346    |
|amd64                |8105885|
+---------------------+-------+



In [None]:
indexer = StringIndexer(inputCol="Census_OSArchitecture", outputCol="Census_OSArchitectureIndex")
data = indexer.fit(data).transform(data)


## Census_OSBranch

frecuencia, aunq este debería ser agrupado en realidad

In [4]:
data.groupBy("Census_OSBranch").count().orderBy('count',ascending=False).show(40,False)

+-------------------------+-------+
|Census_OSBranch          |count  |
+-------------------------+-------+
|rs4_release              |4009158|
|rs3_release              |1237321|
|rs3_release_svc_escrow   |1199767|
|rs2_release              |797066 |
|rs1_release              |785534 |
|th2_release              |326655 |
|th2_release_sec          |266882 |
|th1_st1                  |195840 |
|th1                      |75764  |
|rs5_release              |15324  |
|rs3_release_svc_escrow_im|6181   |
|rs_prerelease            |3171   |
|rs_prerelease_flt        |2714   |
|rs5_release_sigma        |62     |
|rs1_release_srvmedia     |10     |
|winblue_ltsb_escrow      |8      |
|win7sp1_ldr              |3      |
|winblue_ltsb             |3      |
|win8_gdr                 |3      |
|win7sp1_ldr_escrow       |2      |
|rs5_release_sigma_dev    |2      |
|rs5_release_edge         |2      |
|rs_xbox                  |2      |
|Khmer OS                 |1      |
|rs1_release_svc          |1

In [None]:
frequency_census = data.groupBy('Census_OSBranch').count().withColumnRenamed('count','Census_OSBranch_freq')
data = data.join(frequency_census,'Census_OSBranch','left')

## Census_OSEdition

## Census_OSSkuName

## Census_OSInstallTypeName

## Census_OSWUAutoUpdateOptionsName

## Census_GenuineStateName

## Census_ActivationChannel

## Census_FlightRing