<a href="https://colab.research.google.com/github/saurater/ciencia_de_dados_pyspark/blob/main/PySpark_Tutorial_Part_6_MLib_Linear_Regression_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark - Tutorial - Part 6 - MLib -Linear Regression - Intro
Notebook by Sam Faraday
June 2022



## Sources:

Free Code Camp: PySpark Tutorial at https://www.youtube.com/watch?v=_C8kWso4ne4

Apache Spark API Refernce at https://spark.apache.org/docs/latest/api/python/reference/index.html

# 1. Installing PySpark

In [1]:
pip install pyspark # run it every time you connect to Google Colab Notebook

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# 2. Importing the required libraries

In [2]:
import pandas as pd

# 3. Creating the Test5 Dataset

In [3]:
data = {'Index':[1,2,3,4,5,6], 'Name':['Tom', 'Nick', 'Krish', 'Paul','Jack','Sam'], 'Age':[31, 30, 29, 24, 21,23],'Experience':[10, 8, 4, 3,1,2], 'Salary':[30000, 25000, 20000, 20000, 15000, 18000] }
# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Index,Name,Age,Experience,Salary
0,1,Tom,31,10,30000
1,2,Nick,30,8,25000
2,3,Krish,29,4,20000
3,4,Paul,24,3,20000
4,5,Jack,21,1,15000
5,6,Sam,23,2,18000


# 4. Saving the Dataset
to csv

In [4]:
df.to_csv('test6.csv', index=False)

# 5. Initializing PySpark


In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MLIB").getOrCreate()

spark

# 6. Reading the Dataset

In [6]:
df_training = spark.read.csv("test6.csv", header =True, inferSchema =True)
df_training.show()

+-----+-----+---+----------+------+
|Index| Name|Age|Experience|Salary|
+-----+-----+---+----------+------+
|    1|  Tom| 31|        10| 30000|
|    2| Nick| 30|         8| 25000|
|    3|Krish| 29|         4| 20000|
|    4| Paul| 24|         3| 20000|
|    5| Jack| 21|         1| 15000|
|    6|  Sam| 23|         2| 18000|
+-----+-----+---+----------+------+



# 7. Checking the Schema

In [7]:
df_training.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [8]:
df_training.summary().show()

+-------+------------------+----+------------------+-----------------+------------------+
|summary|             Index|Name|               Age|       Experience|            Salary|
+-------+------------------+----+------------------+-----------------+------------------+
|  count|                 6|   6|                 6|                6|                 6|
|   mean|               3.5|null|26.333333333333332|4.666666666666667|21333.333333333332|
| stddev|1.8708286933869707|null| 4.179314138308661|3.559026084010437| 5354.126134736337|
|    min|                 1|Jack|                21|                1|             15000|
|    25%|                 2|null|                23|                2|             18000|
|    50%|                 3|null|                24|                3|             20000|
|    75%|                 5|null|                30|                8|             25000|
|    max|                 6| Tom|                31|               10|             30000|
+-------+-

# 8. Creating Indepent Features Group

Salary is our Dependent Feature, the one we want to predict

Age and Expirience are our Independent Features. Let us group them

In [9]:
from pyspark.ml.feature import VectorAssembler

In [10]:
my_feature_assembler = VectorAssembler(inputCols =['Age', 'Experience'],outputCol="Independent Features" )

In [11]:
output = my_feature_assembler.transform(df_training)

In [12]:
output.show()

+-----+-----+---+----------+------+--------------------+
|Index| Name|Age|Experience|Salary|Independent Features|
+-----+-----+---+----------+------+--------------------+
|    1|  Tom| 31|        10| 30000|         [31.0,10.0]|
|    2| Nick| 30|         8| 25000|          [30.0,8.0]|
|    3|Krish| 29|         4| 20000|          [29.0,4.0]|
|    4| Paul| 24|         3| 20000|          [24.0,3.0]|
|    5| Jack| 21|         1| 15000|          [21.0,1.0]|
|    6|  Sam| 23|         2| 18000|          [23.0,2.0]|
+-----+-----+---+----------+------+--------------------+



In [13]:
output.columns

['Index', 'Name', 'Age', 'Experience', 'Salary', 'Independent Features']

# 9. Selecting only the Dependent Feature (Salary) and the Grouped Features

In [14]:
finalized_data =  output.select("Independent Features", "Salary")
finalized_data.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|         [31.0,10.0]| 30000|
|          [30.0,8.0]| 25000|
|          [29.0,4.0]| 20000|
|          [24.0,3.0]| 20000|
|          [21.0,1.0]| 15000|
|          [23.0,2.0]| 18000|
+--------------------+------+



# 10. Importing the Regession Library

In [15]:
from pyspark.ml.regression import LinearRegression

# 11. Split the Datasets into Training (75%) and Testing (25%)

In [16]:
training_data, testing_data = finalized_data.randomSplit([0.75,0.25])

In [17]:
training_data.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|          [21.0,1.0]| 15000|
|          [23.0,2.0]| 18000|
|          [24.0,3.0]| 20000|
|          [29.0,4.0]| 20000|
|         [31.0,10.0]| 30000|
+--------------------+------+



In [18]:
testing_data.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|          [30.0,8.0]| 25000|
+--------------------+------+



# 12. Fitting the model

In [19]:
regressor = LinearRegression(featuresCol='Independent Features', labelCol='Salary')

In [20]:
regressor = regressor.fit(training_data)


# 13. Checking the Coefficients and Intercepts

In [21]:
regressor.coefficients

DenseVector([-102.53, 1688.6818])

In [22]:
regressor.intercept

16470.039946737463

# 14. Predicting

In [23]:
my_prediction =  regressor.evaluate(testing_data)

In [24]:
my_prediction.predictions.show()



+--------------------+------+------------------+
|Independent Features|Salary|        prediction|
+--------------------+------+------------------+
|          [30.0,8.0]| 25000|26903.595206391477|
+--------------------+------+------------------+



In [25]:
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(regressor.coefficients))
print("Intercept: %s" % str(regressor.intercept))

Coefficients: [-102.52996005325171,1688.6817576564458]
Intercept: 16470.039946737463


In [26]:
my_prediction.meanAbsoluteError

1903.5952063914774

In [27]:
my_prediction.meanSquaredError

3623674.7097966117