Let's start with your project: 

Are you a data scientist? 

I think you are an awesome a data scientist.

### **Problem** 
**Our goal is to create a predictive model that can answer the following question:**

**What kind of people had a better chance of surviving?**

**Data about passengers:**
*   Name
*   Age
*   Gender.


## Install and Import Libraries
Let's install PySpark:

In [1]:
import findspark
findspark.init()
import pyspark

## Build Spark Session

In [2]:
from pyspark.sql import SparkSession 

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sp = spark.sparkContext

21/10/06 02:30:37 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.130.131 instead (on interface ens33)
21/10/06 02:30:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/10/06 02:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
spark = SparkSession.builder.appName('practicallab').getOrCreate()

## Data Loading


You have two datasets: 
* Train  
* Test.

Read two datasets: 
* Train
* Test.



In [5]:
train = spark.read.csv('train.csv', header=True, inferSchema=True)
test = spark.read.csv('test.csv', header=True, inferSchema=True)


In [6]:
# train.show()
# test.show()

Let's work with train dataset:

**Confirm if this is a dataframe or not:**

In [7]:
train

DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]

**Show 5 rows.**

In [8]:
train.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

**Display schema for the dataset:**

In [9]:
train.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



**Statistical summary:**

In [10]:
summary=train.describe()
summary.show()

[Stage 5:>                                                          (0 + 1) / 1]

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

                                                                                

## EDA - Exploratory Data Analysis

**Display count for the train dataset:**

In [11]:
train.count()

891

**Can you answer this question:** 

**How many people survived, and how many didn't survive?** 

**Please save data in a variable.**

In [12]:
survived=train.select('Survived').where(train.Survived==1).count()
not_survived=train.count()-survived

**Display your result:**

In [13]:
print("Survived: {} , Didn't Survive: {}".format(survived,not_survived))

Survived: 342 , Didn't Survive: 549


**Can you display your answer in ratio form?(Hint: Use UDF.)**






In [14]:
def ratio(survived,notsurv):
    sur_ratio=survived/(survived+notsurv)
    not_sur_ration=notsurv/(survived+notsurv)
    return sur_ratio*100,not_sur_ration*100

In [15]:
survived,not_survived=ratio(survived,not_survived)
print("Ratio of Survived: {} % , Ratio of didn't survive: {} % ".format(survived,not_survived))

Ratio of Survived: 38.38383838383838 % , Ratio of didn't survive: 61.61616161616161 % 


**Can you get the number of males and females?**


In [16]:
male=train.select('Sex').where(train.Sex=='male').count()
female=train.select('Sex').where(train.Sex=='female').count()
print("Male: {} , Female: {}".format(male,female))

Male: 577 , Female: 314


**1. What is the average number of survivors of each gender?**

**2. What is the number of survivors of each gender?**

(Hint: Group by the "sex" column.)

In [17]:
male_surv=train.select('Sex').where((train.Sex=='male') & (train.Survived==1)).count()
female_surv=train.select('Sex').where((train.Sex=='female') & (train.Survived==1)).count()
print("Male Survived: {}, Female Survived: {}".format(male_surv,female_surv))

Male Survived: 109, Female Survived: 233


In [18]:
print("What is the average number of survivors of each gender?")
train.groupBy('Sex').agg({'Survived':'mean'}).show()

What is the average number of survivors of each gender?




+------+-------------------+
|   Sex|      avg(Survived)|
+------+-------------------+
|female| 0.7420382165605095|
|  male|0.18890814558058924|
+------+-------------------+





**Create temporary view PySpark:**

In [19]:
train.createOrReplaceTempView("train_tbl")

**How many people survived, and how many didn't survive? By SQL:**

In [20]:
spark.sql("""select count(Survived) as Survived from train_tbl where Survived ==1""").show()
spark.sql("""select count(Survived) as Not_survived from train_tbl where Survived !=1""").show()

+--------+
|Survived|
+--------+
|     342|
+--------+

+------------+
|Not_survived|
+------------+
|         549|
+------------+



**Can you display the number of survivors from each gender as a ratio?**

(Hint: Group by "sex" column.)

**Can you do this via SQL?**

In [21]:
spark.sql('select Sex,avg(Survived)as Survived_Ratio From train_tbl group by Sex').show()



+------+-------------------+
|   Sex|     Survived_Ratio|
+------+-------------------+
|female| 0.7420382165605095|
|  male|0.18890814558058924|
+------+-------------------+



**Display a ratio for p-class:**


In [22]:
spark.sql("select Pclass, \
count(Pclass)/ sum(count(*)) over () as Pclass_Ratio \
          From train_tbl  group by Pclass").show(truncate=False)

21/10/06 02:31:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

+------+-------------------+
|Pclass|Pclass_Ratio       |
+------+-------------------+
|1     |0.24242424242424243|
|3     |0.5510662177328844 |
|2     |0.20650953984287318|
+------+-------------------+





**Let's take a break and continue after this.**

## Data Cleaning

**First and foremost, we must merge both the train and test datasets. (Hint: The union function can do this.)**



In [23]:
test.createOrReplaceTempView("test_tbl")
df=spark.sql("""select * from train_tbl union select * from test_tbl""")

**Display count:**

In [24]:
df.count()

                                                                                

1329

**Temporary view PySpark:**

In [25]:
df.createOrReplaceTempView("df_tbl")

**Can you define the number of null values in each column?**


In [26]:
from pyspark.sql.functions import isnan, when, count, col
def null_value_count(df):
    nullColumnsCounts = []
    Rows = df.count()
    for k in df.columns:
        Rows = df.where(col(k).isNull()).count()
        if(Rows > 0):
            temp = k,Rows
            nullColumnsCounts.append(temp)
    return(nullColumnsCounts)

In [27]:
null_columns_count_list = null_value_count(df)

                                                                                

In [28]:
null_columns_count_list

[('Age', 265), ('Cabin', 1021), ('Embarked', 3)]

**Create Dataframe for null values**

1. Column
2. Number of missing values.

In [29]:
df_null=df.select(df.Age,df.Embarked,df.Cabin)
df_null.show()

+----+--------+-----+
| Age|Embarked|Cabin|
+----+--------+-----+
|39.0|       S| null|
| 9.0|       S| null|
|40.0|       S|  B94|
|36.0|       C|    D|
|26.0|       S| null|
|null|       C| null|
|null|       S| C124|
|48.0|       S| null|
|22.0|       S| null|
|38.0|       S| null|
|45.0|       S| null|
|36.0|       S|  E25|
|60.0|       S| null|
|null|       S| null|
|30.0|       S| null|
|null|       S| null|
|22.0|       S| null|
|null|       Q| null|
|29.0|       S|   B5|
| 2.0|       Q| null|
+----+--------+-----+
only showing top 20 rows



## Preprocessing 

**Can you show me the name column from your temporary table?**

In [30]:
spark.sql('select Name from df_tbl').show()

+--------------------+
|                Name|
+--------------------+
|Andersson, Mr. An...|
|"Goldsmith, Maste...|
|Harrison, Mr. Wil...|
|Levy, Mr. Rene Ja...|
|Nilsson, Miss. He...|
|   Yousif, Mr. Wazli|
|  Klaber, Mr. Herman|
|Herman, Mrs. Samu...|
|Dahlberg, Miss. G...|
|Asplund, Mrs. Car...|
|"Romaine, Mr. Cha...|
|"Flynn, Mr. John ...|
|Brown, Mr. Thomas...|
|        Lam, Mr. Ali|
|Somerton, Mr. Fra...|
|Risien, Mr. Samue...|
|  Ohman, Miss. Velin|
|"O'Leary, Miss. H...|
|Allen, Miss. Elis...|
|Rice, Master. Eugene|
+--------------------+
only showing top 20 rows



**Run this code:**

In [31]:
from pyspark.sql import functions as F
combined = df.withColumn('Title',F.regexp_extract(F.col("Name"),"([A-Za-z]+)\.",1))
combined.createOrReplaceTempView('combined')

**Display the title and count "Title" column:**

In [32]:
combined.select(combined.Title).show()

+------+
| Title|
+------+
|    Mr|
|Master|
|    Mr|
|    Mr|
|  Miss|
|    Mr|
|    Mr|
|   Mrs|
|  Miss|
|   Mrs|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|  Miss|
|  Miss|
|  Miss|
|Master|
+------+
only showing top 20 rows



In [33]:
spark.sql("select count(Title) from combined").show()



+------------+
|count(Title)|
+------------+
|        1329|
+------------+





**We can see that Dr, Rev, Major, Col, Mlle, Capt, Don, Jonkheer, Countess, Ms, Sir, Lady, and Mme are really rare titles, so create Dictionary and set the value to "rare".**

In [34]:
mapping={'Dr':'rare','Rev':'rare','Major':'rare','Col':'rare','Mlle':'rare','Capt':'rare',
        'Don':'rare','Jonkheer':'rare','Countess':'rare','Ms':'rare','Sir':'rare','Lady':'rare','Mme':'rare','Mr':'Mr'
        ,'Mrs':'Mrs','Miss':'Miss','Master':'Master'}
for i in combined.select('Title').collect():
    if i[0] not in mapping.keys():
        mapping[i[0]]=i[0]
mapping
# combined = combined.replace(to_replace=mapping, subset=['Title'])
# combined.select(combined.Title).show(50)
# df2.select(df2.Title).show(50)

                                                                                

{'Dr': 'rare',
 'Rev': 'rare',
 'Major': 'rare',
 'Col': 'rare',
 'Mlle': 'rare',
 'Capt': 'rare',
 'Don': 'rare',
 'Jonkheer': 'rare',
 'Countess': 'rare',
 'Ms': 'rare',
 'Sir': 'rare',
 'Lady': 'rare',
 'Mme': 'rare',
 'Mr': 'Mr',
 'Mrs': 'Mrs',
 'Miss': 'Miss',
 'Master': 'Master'}

**Run the function:**

In [35]:

def impute_title(title):
    return mapping[title]

**Apply the function on "Title" column using UDF:**

In [36]:
from pyspark.sql.functions import *
impute_title_udf=udf(lambda title:impute_title(title))

combined.withColumn("Title", impute_title_udf(F.col("Title"))).show(truncate=False)

[Stage 108:>                                                        (0 + 1) / 1]

+-----------+--------+------+---------------------------------------------------------+------+----+-----+-----+-------------+--------+-----+--------+------+
|PassengerId|Survived|Pclass|Name                                                     |Sex   |Age |SibSp|Parch|Ticket       |Fare    |Cabin|Embarked|Title |
+-----------+--------+------+---------------------------------------------------------+------+----+-----+-----+-------------+--------+-----+--------+------+
|14         |0       |3     |Andersson, Mr. Anders Johan                              |male  |39.0|1    |5    |347082       |31.275  |null |S       |Mr    |
|166        |1       |3     |"Goldsmith, Master. Frank John William ""Frankie"""      |male  |9.0 |0    |2    |363291       |20.525  |null |S       |Master|
|264        |0       |1     |Harrison, Mr. William                                    |male  |40.0|0    |0    |112059       |0.0     |B94  |S       |Mr    |
|293        |0       |2     |Levy, Mr. Rene Jacques       

                                                                                

**Display "Title" from table and group by "Title" column:**

In [37]:
combined.select(combined.Title).show()

+------+
| Title|
+------+
|    Mr|
|Master|
|    Mr|
|    Mr|
|  Miss|
|    Mr|
|    Mr|
|   Mrs|
|  Miss|
|   Mrs|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|    Mr|
|  Miss|
|  Miss|
|  Miss|
|Master|
+------+
only showing top 20 rows



## **Preprocessing Age**

**Based on the age mean, you will fill in the missing age values:**

In [38]:
from pyspark.sql.functions import mean
mean = combined.select(mean(combined['Age'])).collect()
mean[0][0]

                                                                                

30.079501879699244

**Fill missing age with age mean:**

In [39]:
combined = combined.na.fill(mean[0][0], subset='Age')
combined.select('Age').show()

+------------------+
|               Age|
+------------------+
|              39.0|
|               9.0|
|              40.0|
|              36.0|
|              26.0|
|30.079501879699244|
|30.079501879699244|
|              48.0|
|              22.0|
|              38.0|
|              45.0|
|              36.0|
|              60.0|
|30.079501879699244|
|              30.0|
|30.079501879699244|
|              22.0|
|30.079501879699244|
|              29.0|
|               2.0|
+------------------+
only showing top 20 rows



## **Preprocessing Embarked**

**Select Embarked, count them, order by count Desc, and save in grouped_Embarked variable:**




In [40]:

embarked = combined.select('Embarked')\
                            .groupBy('Embarked')\
                            .count().orderBy('count',ascending=False)

**Show groupped_Embarked:**

In [41]:
embarked.show()



+--------+-----+
|Embarked|count|
+--------+-----+
|       S|  962|
|       C|  253|
|       Q|  111|
|    null|    3|
+--------+-----+



                                                                                

**Get the groupped_Embarked:** 

In [42]:
embarked.collect()

                                                                                

[Row(Embarked='S', count=962),
 Row(Embarked='C', count=253),
 Row(Embarked='Q', count=111),
 Row(Embarked=None, count=3)]

**Fill missing values with Top 'S' of grouped_Embarked:**

In [43]:
combined.na.fill('S',subset='Embarked').show()

+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex|               Age|SibSp|Parch|       Ticket|    Fare|Cabin|Embarked| Title|
+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|         14|       0|     3|Andersson, Mr. An...|  male|              39.0|    1|    5|       347082|  31.275| null|       S|    Mr|
|        166|       1|     3|"Goldsmith, Maste...|  male|               9.0|    0|    2|       363291|  20.525| null|       S|Master|
|        264|       0|     1|Harrison, Mr. Wil...|  male|              40.0|    0|    0|       112059|     0.0|  B94|       S|    Mr|
|        293|       0|     2|Levy, Mr. Rene Ja...|  male|              36.0|    0|    0|SC/Paris 2163|  12.875|    D|       C|    Mr|
|        316|       1|     3|Nilsson, Miss. He...|female|     

In [44]:
combined.show()

+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex|               Age|SibSp|Parch|       Ticket|    Fare|Cabin|Embarked| Title|
+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|         14|       0|     3|Andersson, Mr. An...|  male|              39.0|    1|    5|       347082|  31.275| null|       S|    Mr|
|        166|       1|     3|"Goldsmith, Maste...|  male|               9.0|    0|    2|       363291|  20.525| null|       S|Master|
|        264|       0|     1|Harrison, Mr. Wil...|  male|              40.0|    0|    0|       112059|     0.0|  B94|       S|    Mr|
|        293|       0|     2|Levy, Mr. Rene Ja...|  male|              36.0|    0|    0|SC/Paris 2163|  12.875|    D|       C|    Mr|
|        316|       1|     3|Nilsson, Miss. He...|female|     

## **Preprocessing Cabin**

**Replace "cabin" column with first char from the string:**



In [45]:
from pyspark.sql.functions import *
combined=combined.withColumn('Cabin',split(col('Cabin'),"").getItem(0))

**Show the result:**

In [46]:
combined.select(combined.Cabin).show()

+-----+
|Cabin|
+-----+
| null|
| null|
|    B|
|    D|
| null|
| null|
|    C|
| null|
| null|
| null|
| null|
|    E|
| null|
| null|
| null|
| null|
| null|
| null|
|    B|
| null|
+-----+
only showing top 20 rows



**Create the temporary view:**

In [47]:
combined.createOrReplaceTempView("com_tbl")

**Select "Cabin" column, count Cabin column, Group by "Cabin" column, Order By count DESC**  

In [48]:
spark.sql('select Cabin,count(Cabin) as count from com_tbl group by Cabin order by count(Cabin) Desc').show()



+-----+-----+
|Cabin|count|
+-----+-----+
|    C|   82|
|    B|   77|
|    D|   52|
|    E|   51|
|    A|   23|
|    F|   18|
|    G|    4|
|    T|    1|
| null|    0|
+-----+-----+



                                                                                

**Fill missing values with "U":**

In [49]:
combined = combined.na.fill('U',subset='Cabin')
combined.select(combined.Cabin).show()

+-----+
|Cabin|
+-----+
|    U|
|    U|
|    B|
|    D|
|    U|
|    U|
|    C|
|    U|
|    U|
|    U|
|    U|
|    E|
|    U|
|    U|
|    U|
|    U|
|    U|
|    U|
|    B|
|    U|
+-----+
only showing top 20 rows



**StringIndexer: A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.**

**StringIndexer(inputCol=None, outputCol=None)**

**Pipeline: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.**

____________________________________________

**Use Pipline to fit and transform:**

In [50]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler
import numpy
from pyspark.ml import Pipeline


**VectorAssembler: VectorAssembler(*, inputCols=None, outputCol=None) A feature transformer that merges multiple columns into a vector column.**



**Use randomSplit function and split data to x_train, and X_test with 80% and 20% Consecutive**

In [51]:
train,test=combined.randomSplit([.8,.2],seed=0)
train.show()

+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex|               Age|SibSp|Parch|       Ticket|    Fare|Cabin|Embarked| Title|
+-----------+--------+------+--------------------+------+------------------+-----+-----+-------------+--------+-----+--------+------+
|         14|       0|     3|Andersson, Mr. An...|  male|              39.0|    1|    5|       347082|  31.275|    U|       S|    Mr|
|        166|       1|     3|"Goldsmith, Maste...|  male|               9.0|    0|    2|       363291|  20.525|    U|       S|Master|
|        264|       0|     1|Harrison, Mr. Wil...|  male|              40.0|    0|    0|       112059|     0.0|    B|       S|    Mr|
|        293|       0|     2|Levy, Mr. Rene Ja...|  male|              36.0|    0|    0|SC/Paris 2163|  12.875|    D|       C|    Mr|
|        316|       1|     3|Nilsson, Miss. He...|female|     

In [52]:
categoricalCols = [field for (field, dataType) in train.dtypes
                   if dataType == "string"]
categoricalCols.remove('Name')
# categoricalCols.remove('Title')
categoricalCols

['Sex', 'Ticket', 'Cabin', 'Embarked', 'Title']

In [53]:
indexOutputCols = [x + "_Index" for x in categoricalCols]
indexOutputCols

['Sex_Index', 'Ticket_Index', 'Cabin_Index', 'Embarked_Index', 'Title_Index']

In [54]:
oheOutputCols = [x + "_OHE" for x in categoricalCols]
oheOutputCols

['Sex_OHE', 'Ticket_OHE', 'Cabin_OHE', 'Embarked_OHE', 'Title_OHE']

In [55]:
stringIndexer = StringIndexer(inputCols=categoricalCols,
                             outputCols=indexOutputCols,
                             handleInvalid='skip')
oheEncoder = OneHotEncoder(inputCols=indexOutputCols,
                          outputCols=oheOutputCols)

In [56]:
numericCols = [field for (field,dataType) in train.dtypes
              if ((dataType=='double')or (dataType=='int')& (field!='Survived'))]
numericCols.remove('PassengerId')
numericCols

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [57]:
vecAssembler = VectorAssembler(inputCols=oheOutputCols+numericCols,outputCol='features')

**Build RandomForestClassifier model and use pipeline to fit and transform then display "prediction, Survived, features" columns**

In [58]:
from pyspark.ml.classification import RandomForestClassifier
random_forest= RandomForestClassifier(featuresCol='features', labelCol='Survived', predictionCol='prediction', maxDepth=7)


In [59]:
pipeline =Pipeline(stages = [stringIndexer,oheEncoder,vecAssembler,random_forest])

In [60]:
Model=pipeline.fit(train)

                                                                                

In [61]:
predDF = Model.transform(test)

**Use MulticlassClassificationEvaluator and set the "labelCol" to "Survived",  "predictionCol" to "prediction", "metricName" to "accuracy"** 

In [62]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
MClassC = MulticlassClassificationEvaluator(predictionCol='prediction',
                                        labelCol='Survived',
                                        metricName='accuracy')

In [63]:
accuracy = MClassC.evaluate(predDF)
accuracy

                                                                                

0.838150289017341

**When you are finished send the project via Google classroom**
**Please let me know if you have any questions.**
* nabieh.mostafa@yahoo.com
* +201015197566 (Whatsapp)

**Don't Hate me, I push you to learn**

**I will help you to become an awesome data engineer.**

**Why did I say that "Data Engineer"?**

**Tricky question, but an optional question, if you would like to know the answer, ask me.**
