# Mode Cloud 
* Ce Notebook permet d'extraire les feature du dataset composé de 131 dossier d'images de fruitsdifférents. 22688 images au total.
* L'extraction des feature est réalisée grâce au réseau de neurones MobilNEt V2
* Les features sont stockés au format parquet 
* Puis un PCA est réalisé afin de réduire les dimension

### 4.10.2 Installation des packages

Les packages nécessaires ont été installé via l'étape de **bootstrap** à l'instanciation du serveur.

### 4.10.3 Import des librairies

In [1]:
# L'exécution de cette cellule démarre l'application Spark

In [1]:
import pandas as pd
import numpy as np
import io
import os
import tensorflow as tf
from PIL import Image
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras import Model
from pyspark.sql.functions import col, pandas_udf, PandasUDFType, element_at, split

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1685522011997_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
from pyspark.ml.feature import VectorAssembler, StandardScaler, PCA

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<u>Affichage des informations sur la session en cours et liens vers Spark UI</u> :

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
11,application_1685352775435_0013,pyspark,dead,Link,Link,
12,application_1685352775435_0014,pyspark,idle,Link,Link,✔


### 4.10.4 Définition des PATH pour charger les images et enregistrer les résultats

Nous accédons directement à nos **données sur S3** comme si elles étaient **stockées localement**.

In [1]:
PATH = 's3://sbt-calculsdistribues1'
PATH_Data = PATH+'/Test'
PATH_Result = PATH+'/Results-all-2'
print('PATH:        '+\
      PATH+'\nPATH_Data:   '+\
      PATH_Data+'\nPATH_Result: '+PATH_Result)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1708502544573_0002,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

PATH:        s3://sbt-calculsdistribues1
PATH_Data:   s3://sbt-calculsdistribues1/Test
PATH_Result: s3://sbt-calculsdistribues1/Results-all-2

### 4.10.5 Traitement des données

#### 4.10.5.1 Chargement des données

In [4]:
images = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(PATH_Data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<u>Je ne conserve que le **path** de l'image et j'ajoute <br />
    une colonne contenant les **labels** de chaque image</u> :

In [7]:
images = images.withColumn('label', element_at(split(images['path'], '/'),-2))
print(images.printSchema())
print(images.select('path','label').show(5,False))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)
 |-- label: string (nullable = true)

None
+--------------------------------------------------------+----------+
|path                                                    |label     |
+--------------------------------------------------------+----------+
|s3://sbt-calculsdistribues/Test/Watermelon/r_106_100.jpg|Watermelon|
|s3://sbt-calculsdistribues/Test/Watermelon/r_109_100.jpg|Watermelon|
|s3://sbt-calculsdistribues/Test/Watermelon/r_108_100.jpg|Watermelon|
|s3://sbt-calculsdistribues/Test/Watermelon/r_107_100.jpg|Watermelon|
|s3://sbt-calculsdistribues/Test/Watermelon/r_95_100.jpg |Watermelon|
+--------------------------------------------------------+----------+
only showing top 5 rows

None

#### 4.10.5.2 Préparation du modèle

In [8]:
model = MobileNetV2(weights='imagenet',
                    include_top=True,
                    input_shape=(224, 224, 3))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5

In [9]:
new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
brodcast_weights = sc.broadcast(new_model.get_weights())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
new_model.summary()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
Conv1 (Conv2D)                  (None, 112, 112, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
bn_Conv1 (BatchNormalization)   (None, 112, 112, 32) 128         Conv1[0][0]                      
__________________________________________________________________________________________________
Conv1_relu (ReLU)               (None, 112, 112, 32) 0           bn_Conv1[0][0]                   
______________________________________________________________________________________________

In [12]:
def model_fn():
    """
    Returns a MobileNetV2 model with top layer removed 
    and broadcasted pretrained weights.
    """
    model = MobileNetV2(weights='imagenet',
                        include_top=True,
                        input_shape=(224, 224, 3))
    for layer in model.layers:
        layer.trainable = False
    new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)
    new_model.set_weights(brodcast_weights.value)
    return new_model

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### 4.10.5.3 Définition du processus de chargement des images <br/> et application de leur featurisation à travers l'utilisation de pandas UDF

In [13]:
def preprocess(content):
    """
    Preprocesses raw image bytes for prediction.
    """
    img = Image.open(io.BytesIO(content)).resize([224, 224])
    arr = img_to_array(img)
    return preprocess_input(arr)

def featurize_series(model, content_series):
    """
    Featurize a pd.Series of raw images using the input model.
    :return: a pd.Series of image features
    """
    input = np.stack(content_series.map(preprocess))
    preds = model.predict(input)
    # For some layers, output features will be multi-dimensional tensors.
    # We flatten the feature tensors to vectors for easier storage in Spark DataFrames.
    output = [p.flatten() for p in preds]
    return pd.Series(output)

@pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
    '''
    This method is a Scalar Iterator pandas UDF wrapping our featurization function.
    The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).

    :param content_series_iter: This argument is an iterator over batches of data, where each batch
                              is a pandas Series of image data.
    '''
    # With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
    # for multiple data batches.  This amortizes the overhead of loading big models.
    model = model_fn()
    for content_series in content_series_iter:
        yield featurize_series(model, content_series)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…



#### 4.10.5.4 Exécutions des actions d'extractions de features

In [14]:
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1024")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
features_df = images.repartition(24).select(col("path"),
                                            col("label"),
                                            featurize_udf("content").alias("features")
                                           )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
print(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

s3://sbt-calculsdistribues/Results-all-2

In [17]:
features_df.write.mode("overwrite").parquet(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### 4.10.6 Chargement des données enregistrées et validation du résultat

In [19]:
df_spark = spark.read.parquet(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
df_spark.features

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Column<'features'>

In [20]:
df = df_spark.toPandas()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [19]:
df.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                path  ...                                           features
0  s3://sbt-calculsdistribues/Test/Watermelon/r_1...  ...  [0.0, 0.9346336722373962, 0.14799268543720245,...
1  s3://sbt-calculsdistribues/Test/Watermelon/r_6...  ...  [1.3194596767425537, 0.2760419249534607, 0.0, ...
2  s3://sbt-calculsdistribues/Test/Watermelon/r_8...  ...  [0.5296130180358887, 0.09730405360460281, 0.0,...
3  s3://sbt-calculsdistribues/Test/Pineapple Mini...  ...  [0.0, 4.512625694274902, 0.0, 0.0, 0.0, 0.0, 0...
4  s3://sbt-calculsdistribues/Test/Pineapple Mini...  ...  [0.007994337007403374, 4.551527500152588, 0.0,...

[5 rows x 3 columns]

In [21]:
from pyspark.ml.linalg import Vectors

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Conversion au bon format pour transmettre un dataframe spark , à la fonction PCA de pyspark

In [22]:
array_list = df.features.to_list()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
vectors_list = [ Vectors.dense(array) for array in array_list]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
df2 = spark.createDataFrame([(vector,) for vector in vectors_list],["features"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
df_s = df2.sample(fraction=0.05)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
df_spark.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(path='s3://sbt-calculsdistribues/Test/Watermelon/r_110_100.jpg', label='Watermelon', features=[0.0, 0.9346336722373962, 0.14799268543720245, 0.0, 1.2669360637664795, 0.0, 0.378207266330719, 0.33445054292678833, 0.0, 0.0, 1.2442141771316528, 0.30819788575172424, 0.05890428274869919, 0.020689580589532852, 0.16925126314163208, 0.56295245885849, 0.0, 0.0, 0.07362665235996246, 0.22331689298152924, 0.0, 0.0, 0.0, 0.057339806109666824, 0.007363423239439726, 0.9852887988090515, 1.3186299800872803, 0.0, 0.0, 2.206303596496582, 0.0, 0.0, 0.3238607347011566, 0.23553502559661865, 0.0, 0.36168450117111206, 0.100311279296875, 1.3656412363052368, 0.06170520931482315, 0.0, 0.006947480142116547, 0.0, 1.1449416875839233, 0.049447644501924515, 0.3294565975666046, 0.0, 0.187885582447052, 0.1520332545042038, 1.7633845806121826, 0.09460563212633133, 0.0, 0.013475589454174042, 0.31552180647850037, 0.0, 0.4473423361778259, 1.9310648441314697, 0.0, 1.5819429159164429, 0.0, 0.12968990206718445, 0.2610733509

# La mémoire du driver nécessite d'être redimensionner pour ne pas générer de probleme

In [11]:
%%configure -f 
{"driverMemory": "6000M"}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1685522011997_0004,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1685522011997_0004,pyspark,idle,Link,Link,✔


# Features provenant des images via le réseau MobilNetV2

In [31]:
df2.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|            features|
+--------------------+
|[0.0,0.9346336722...|
|[1.31945967674255...|
|[0.52961301803588...|
|[0.0,4.5126256942...|
|[0.00799433700740...|
|[0.04449297487735...|
|[0.00461275922134...|
|[0.0,0.2693336904...|
|[0.0,1.0986049175...|
|[0.0,0.4306903779...|
|[0.01588310301303...|
|[0.09707429260015...|
|[1.65572965145111...|
|[0.02456750161945...|
|[0.06463862210512...|
|[0.04984579607844...|
|[0.0,0.0465781427...|
|[0.0,0.0279957950...|
|[0.0,0.0059984656...|
|[0.0,3.9508583545...|
+--------------------+
only showing top 20 rows

# Première PCA à 10 composantes pour vérifier le bon fonctionnement

In [25]:
#start = time.time()
n_components = 10
pca = PCA(
    k = n_components, 
    inputCol = 'features', 
    outputCol = 'pcaFeatures'
).fit(df2)
df3 = pca.transform(df2)

#end = time.time()
#print(end - start)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Dataframe Sortie de la PCA

In [30]:
df3.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+
|            features|         pcaFeatures|
+--------------------+--------------------+
|[0.0,0.9346336722...|[-2.7870339018409...|
|[1.31945967674255...|[-2.5390918716231...|
|[0.52961301803588...|[-4.6429835607350...|
|[0.0,4.5126256942...|[-5.8523372512992...|
|[0.00799433700740...|[-6.0254677295371...|
|[0.04449297487735...|[-3.4015426145020...|
|[0.00461275922134...|[-2.2765656132952...|
|[0.0,0.2693336904...|[0.73589999249634...|
|[0.0,1.0986049175...|[-4.1329483208380...|
|[0.0,0.4306903779...|[-4.4896372468841...|
|[0.01588310301303...|[-0.4455515374955...|
|[0.09707429260015...|[-1.0467383878491...|
|[1.65572965145111...|[0.58033100507965...|
|[0.02456750161945...|[-4.8309496801690...|
|[0.06463862210512...|[-0.1297775569271...|
|[0.04984579607844...|[-7.1077086590428...|
|[0.0,0.0465781427...|[-2.8010905413473...|
|[0.0,0.0279957950...|[-3.1137810070018...|
|[0.0,0.0059984656...|[-3.0946362447205...|
|[0.0,3.9508583545...|[-3.126799

In [27]:
df4 = df3.toPandas()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
df4.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                            features                                        pcaFeatures
0  [0.0, 0.9346336722373962, 0.14799268543720245,...  [-2.787033901840964, 5.261695069807606, -6.107...
1  [1.3194596767425537, 0.2760419249534607, 0.0, ...  [-2.5390918716231896, 5.514261954816636, -6.15...
2  [0.5296130180358887, 0.09730405360460281, 0.0,...  [-4.642983560735027, 7.486536010933872, -5.899...
3  [0.0, 4.512625694274902, 0.0, 0.0, 0.0, 0.0, 0...  [-5.852337251299286, 4.030082367206919, 0.9545...
4  [0.007994337007403374, 4.551527500152588, 0.0,...  [-6.025467729537131, 3.681532899600041, 0.6934...

# PCA à 750 Composantes

In [32]:
#start = time.time()
n_components = 750
pca = PCA(
    k = n_components, 
    inputCol = 'features', 
    outputCol = 'pcaFeatures'
).fit(df2)
df750 = pca.transform(df2)

#end = time.time()
#print(end - start)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
pca.explainedVariance

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DenseVector([0.1014, 0.0801, 0.0635, 0.0501, 0.0354, 0.0292, 0.0277, 0.0228, 0.0199, 0.0191, 0.0165, 0.0147, 0.014, 0.0137, 0.0133, 0.0125, 0.0116, 0.0107, 0.0098, 0.0097, 0.0092, 0.0083, 0.0079, 0.0075, 0.0071, 0.0071, 0.0068, 0.0062, 0.0061, 0.0059, 0.0057, 0.0056, 0.0053, 0.0051, 0.0048, 0.0047, 0.0046, 0.0043, 0.0042, 0.0041, 0.0039, 0.0039, 0.0038, 0.0037, 0.0036, 0.0034, 0.0034, 0.0033, 0.0033, 0.0032, 0.0032, 0.0031, 0.003, 0.0029, 0.0028, 0.0028, 0.0028, 0.0027, 0.0026, 0.0025, 0.0025, 0.0025, 0.0024, 0.0023, 0.0023, 0.0022, 0.0022, 0.0022, 0.0021, 0.0021, 0.0021, 0.002, 0.0019, 0.0019, 0.0019, 0.0018, 0.0018, 0.0018, 0.0017, 0.0017, 0.0017, 0.0017, 0.0016, 0.0016, 0.0016, 0.0016, 0.0016, 0.0015, 0.0015, 0.0015, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0014, 0.0013, 0.0013, 0.0013, 0.0013, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0012, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.0011, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.0009, 0.0009, 0.000

In [37]:
np.cumsum(pca.explainedVariance)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

array([0.10140993, 0.18146972, 0.24497274, 0.29512059, 0.33047488,
       0.3596348 , 0.38736694, 0.41021583, 0.43007527, 0.4491668 ,
       0.46570387, 0.48036675, 0.49434859, 0.50803436, 0.52138226,
       0.5338826 , 0.54543343, 0.55618264, 0.56598301, 0.57569308,
       0.58485138, 0.59313437, 0.60102209, 0.60852993, 0.61567809,
       0.62275045, 0.62958686, 0.63580522, 0.64192727, 0.64781775,
       0.6535209 , 0.65909672, 0.66437057, 0.66942059, 0.67420583,
       0.67889421, 0.68348948, 0.68783587, 0.69207105, 0.69616312,
       0.70006549, 0.70395477, 0.70779923, 0.71152293, 0.71515284,
       0.71857504, 0.72197207, 0.72530281, 0.72859337, 0.73183042,
       0.73502977, 0.73808652, 0.7410565 , 0.74393956, 0.74676961,
       0.7495623 , 0.75231607, 0.75500415, 0.75760227, 0.76012722,
       0.76260926, 0.76506099, 0.76741253, 0.76973087, 0.77202279,
       0.77426796, 0.77646267, 0.77863137, 0.7807576 , 0.78287009,
       0.7849326 , 0.78692533, 0.78885697, 0.79076406, 0.79263

In [33]:
df750.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+
|            features|         pcaFeatures|
+--------------------+--------------------+
|[0.0,0.9346336722...|[-2.7870339018409...|
|[1.31945967674255...|[-2.5390918716231...|
|[0.52961301803588...|[-4.6429835607350...|
|[0.0,4.5126256942...|[-5.8523372512992...|
|[0.00799433700740...|[-6.0254677295371...|
|[0.04449297487735...|[-3.4015426145020...|
|[0.00461275922134...|[-2.2765656132952...|
|[0.0,0.2693336904...|[0.73589999249634...|
|[0.0,1.0986049175...|[-4.1329483208380...|
|[0.0,0.4306903779...|[-4.4896372468841...|
|[0.01588310301303...|[-0.4455515374955...|
|[0.09707429260015...|[-1.0467383878491...|
|[1.65572965145111...|[0.58033100507965...|
|[0.02456750161945...|[-4.8309496801690...|
|[0.06463862210512...|[-0.1297775569271...|
|[0.04984579607844...|[-7.1077086590428...|
|[0.0,0.0465781427...|[-2.8010905413473...|
|[0.0,0.0279957950...|[-3.1137810070018...|
|[0.0,0.0059984656...|[-3.0946362447205...|
|[0.0,3.9508583545...|[-3.126799

In [38]:
#start = time.time()
n_components = 1000
pca = PCA(
    k = n_components, 
    inputCol = 'features', 
    outputCol = 'pcaFeatures'
).fit(df2)
df1000 = pca.transform(df2)

#end = time.time()
#print(end - start)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [40]:
df1000.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+
|            features|         pcaFeatures|
+--------------------+--------------------+
|[0.0,0.9346336722...|[-2.7870339018409...|
|[1.31945967674255...|[-2.5390918716231...|
|[0.52961301803588...|[-4.6429835607350...|
|[0.0,4.5126256942...|[-5.8523372512992...|
|[0.00799433700740...|[-6.0254677295371...|
|[0.04449297487735...|[-3.4015426145020...|
|[0.00461275922134...|[-2.2765656132952...|
|[0.0,0.2693336904...|[0.73589999249634...|
|[0.0,1.0986049175...|[-4.1329483208380...|
|[0.0,0.4306903779...|[-4.4896372468841...|
|[0.01588310301303...|[-0.4455515374955...|
|[0.09707429260015...|[-1.0467383878491...|
|[1.65572965145111...|[0.58033100507965...|
|[0.02456750161945...|[-4.8309496801690...|
|[0.06463862210512...|[-0.1297775569271...|
|[0.04984579607844...|[-7.1077086590428...|
|[0.0,0.0465781427...|[-2.8010905413473...|
|[0.0,0.0279957950...|[-3.1137810070018...|
|[0.0,0.0059984656...|[-3.0946362447205...|
|[0.0,3.9508583545...|[-3.126799

In [39]:
np.cumsum(pca.explainedVariance)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

array([0.10140993, 0.18146972, 0.24497274, 0.29512059, 0.33047488,
       0.3596348 , 0.38736694, 0.41021583, 0.43007527, 0.4491668 ,
       0.46570387, 0.48036675, 0.49434859, 0.50803436, 0.52138226,
       0.5338826 , 0.54543343, 0.55618264, 0.56598301, 0.57569308,
       0.58485138, 0.59313437, 0.60102209, 0.60852993, 0.61567809,
       0.62275045, 0.62958686, 0.63580522, 0.64192727, 0.64781775,
       0.6535209 , 0.65909672, 0.66437057, 0.66942059, 0.67420583,
       0.67889421, 0.68348948, 0.68783587, 0.69207105, 0.69616312,
       0.70006549, 0.70395477, 0.70779923, 0.71152293, 0.71515284,
       0.71857504, 0.72197207, 0.72530281, 0.72859337, 0.73183042,
       0.73502977, 0.73808652, 0.7410565 , 0.74393956, 0.74676961,
       0.7495623 , 0.75231607, 0.75500415, 0.75760227, 0.76012722,
       0.76260926, 0.76506099, 0.76741253, 0.76973087, 0.77202279,
       0.77426796, 0.77646267, 0.77863137, 0.7807576 , 0.78287009,
       0.7849326 , 0.78692533, 0.78885697, 0.79076406, 0.79263

# Recuperer les features

In [48]:
df1000.write.mode("overwrite").parquet(PATH+"/pcaFeature1000.parquet")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…