# Step 1. Loading the data into dataframes
In this step, the code creates a Spark session and loads the data into three dataframes: posts, postType, and users. The data is read from parquet and CSV files.


1.1: Imports the necessary libraries and functions

In [0]:
# Import necessary libraries and functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, translate, trim, explode, regexp_replace, col, lower
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

import plotly.graph_objects as go
import pandas as pd
import plotly.graph_objects as go
import plotly.colors

1.2 Initialize a Spark session.  

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Data Loading") \
    .getOrCreate()

1.3 Load the data into dataframes into three dataframes: posts, postType, and users. 

In [0]:
# Load posts data from a parquet file
posts = spark.read.parquet("/mnt/deBDProject/ml_training/Posts/*")

# Display the posts DataFrame
#display(posts)


In [0]:
# Load postType data from a CSV file
postType = spark.read.csv("/mnt/deBDProject/ml_training/PostTypes.txt", header=True, inferSchema=True)

# Display the postType DataFrame
# display(postType)

In [0]:
# Load users data from a CSV file
users = spark.read.format("csv").option("header", "true").load("/mnt/deBDProject/ml_training/users.csv")

# Display the users DataFrame
# display(users)


- (Optional) Check the loaded data.


In [0]:
# Print the schema of each dataframe
posts.printSchema()
postType.printSchema()
users.printSchema()

# Display the first few rows of each dataframe
posts.show()
postType.show()
users.show()

root
 |-- id: integer (nullable = true)
 |-- AcceptedAnswerId: integer (nullable = true)
 |-- AnswerCount: integer (nullable = true)
 |-- Body: string (nullable = true)
 |-- CommentCount: integer (nullable = true)
 |-- CreationDate: timestamp (nullable = true)
 |-- FavoriteCount: integer (nullable = true)
 |-- LastEditDate: timestamp (nullable = true)
 |-- LastEditorDisplayName: string (nullable = true)
 |-- LastEditorUserId: integer (nullable = true)
 |-- OwnerUserId: integer (nullable = true)
 |-- ParentId: integer (nullable = true)
 |-- PostTypeId: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- ViewCount: integer (nullable = true)

root
 |-- Id: integer (nullable = true)
 |-- Type: string (nullable = true)

root
 |-- id: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- DisplayName: string (nullable = true)
 |-- DownVotes: string (nu

1.3 Saving the dataframes for easy retrieval.


In [0]:
# Save the 3 tables to databricks local file system
posts.write.mode('overwrite').parquet("/tmp/project/posts.parquet")
postType.write.mode('overwrite').parquet("/tmp/project/PostType.parquet")
users.write.mode('overwrite').parquet("/tmp/project/user.parquet")

In [0]:
# Creating Spark Session
spark = (SparkSession
         .builder
         .appName("ML Model")
         .getOrCreate())

sc = spark.sparkContext

In [0]:
# Read in the tables
posts = spark.read.parquet("/tmp/project/posts.parquet")
postType = spark.read.parquet("/tmp/project/PostType.parquet")
Users = spark.read.parquet("/tmp/project/user.parquet")

# Step 2. Join tables and Explore the Data
- Join the tables
- Visualize the data using charts and plots
- Filter the data
- Format the body and tag columns for ML
- Select columns

2.1 Join the tables

Based on the table schemas you provided, it appears that we can join the "post" and "postTypes" tables based on the "PostTypeId" column. Additionally, we can join the "post" and "users" tables based on the "OwnerUserId" column.

In [0]:
# Join "post" and "postTypes" tables based on "PostTypeId"
joined_df = posts.join(postType, posts.PostTypeId == postType.Id, "inner")
#joined_df = posts.join(postType, posts.PostTypeId == postType.id, "inner")

# Join "post" and "users" tables based on "OwnerUserId"
joined_df = joined_df.join(users, joined_df.OwnerUserId == users.id, "inner")

display(joined_df)


id,AcceptedAnswerId,AnswerCount,Body,CommentCount,CreationDate,FavoriteCount,LastEditDate,LastEditorDisplayName,LastEditorUserId,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount,Id,Type,id.1,Age,CreationDate.1,DisplayName,DownVotes,EmailHash,Location,Reputation,UpVotes,Views,WebsiteUrl,AccountId
7015446,0,0,"$seconds = time() - strtotime('2011-01-01 00:00:00'); $minutes = $seconds / 60; To elaborate a bit more: This is some simple manipulation of a unix timestamp (number of seconds since Jan 1, 1970). So you take the current timestamp and subtract what the timestamp would have been on the first of the month. This gives you total seconds that have elapsed this month. If you divide by 60, you get total minutes that have elapsed this month.",3,2023-08-10T00:00:00Z,0,2023-08-10T00:00:00Z,,790335,402253,7015410,2,2.0,,,0.0,2.0,Answer,402253.0,,2018-07-26,Mchl,114.0,,"Warsaw, Poland",49808.0,2450.0,2243.0,http://blog.michaljarosz.biz,173515.0
7003283,0,0,"Afer the panel is rendered (afterrender event) check if store is is loaded yet (usally it will not be, unless panel's render has been deferred because it's in an inactive tab for example). If the test fails in the same event add load listener to the store, that will push data back to panel once it's ready.",2,2023-08-09T00:00:00Z,0,2023-08-10T00:00:00Z,,402253,402253,6997996,2,1.0,,,0.0,2.0,Answer,402253.0,,2018-07-26,Mchl,114.0,,"Warsaw, Poland",49808.0,2450.0,2243.0,http://blog.michaljarosz.biz,173515.0
7017038,0,0,mongod is the primary MongoDB database process that runs on an individual server,0,2023-08-10T00:00:00Z,0,2023-08-10T00:00:00Z,,402253,402253,0,4,0.0,,,0.0,4.0,TagWikiExerpt,402253.0,,2018-07-26,Mchl,114.0,,"Warsaw, Poland",49808.0,2450.0,2243.0,http://blog.michaljarosz.biz,173515.0
7018275,0,0,"This is most likely because $stmt = $mysqli->prepare(...); line fails due to SQL syntax error. Try echoing $mysqli->error to see what's wrong with it. Try calling $stmt->store_result(); after execution of your SELECT statement and before issuing any other queries to MySQL. Side note: you should prepare your statement before foreach loop. That will get you a bit of performance gain, since the statement will only be compiled once and only parameters will be sent to server on each loop run.",5,2023-08-10T00:00:00Z,0,2023-08-10T00:00:00Z,,402253,402253,7018137,2,3.0,,,0.0,2.0,Answer,402253.0,,2018-07-26,Mchl,114.0,,"Warsaw, Poland",49808.0,2450.0,2243.0,http://blog.michaljarosz.biz,173515.0
7013733,0,1,"""If you see the following code Table tblTest = (Table)tblControl; StringBuilder text = new StringBuilder(); StringWriter writer = new StringWriter(text); HtmlTextWriter htmlWriter = new HtmlTextWriter(writer); tblTest.RenderControl(htmlWriter); htmlCode = text.ToString(); here i am converting a table object to string. I'll get the output as """"<table><tr><td>item</td></tr></table>"""" Now i want to Rollback it. I am having a string and i need to convert that into WebControls.Table object. Please someone suggest some way. """,0,2023-08-10T00:00:00Z,2,2023-08-10T00:00:00Z,,76337,885771,0,1,2.0,,Convert string to WebControls - asp.net,1393.0,1.0,Question,885771.0,,2019-08-09,michael,0.0,,,16.0,1.0,3.0,,475439.0
7014321,0,0,"assuming the code is in a file named login.java... compile with: javac login.java should produce login.class, run with: java login",0,2023-08-10T00:00:00Z,0,2023-08-10T00:00:00Z,,683825,105536,7014276,2,1.0,,,0.0,2.0,Answer,105536.0,,2017-05-12,Tim Hoolihan,13.0,,"Copley, OH, United States",10489.0,426.0,829.0,http://timhoolihan.com,36951.0
5018324,0,0,"Subqueries execute every time you evaluate them (in MySQL anyway, not all RDBMSes), i.e. you're basically running 7 million queries! Using a JOIN, if possible, will reduce this to 1. Even if adding indexing improves performance of those, you're still running them.",7,2023-02-16T00:00:00Z,0,2023-08-10T00:00:00Z,,277084,277084,5018284,2,22.0,,,0.0,2.0,Answer,277084.0,,2018-02-19,Brian,5.0,,United Kingdom,5681.0,1391.0,333.0,http://n/a,103447.0
6873954,0,0,Abstract away your table item adding necessary fields to the interface or base class: interface ITableItem // or just a simple or abstract class { // common fields go here } Then can you make your item group generic with a constraint on generic parameter. public class ItemGroup<T> where T: ITableItem { public string SectionName { get; set; } public List<T> Items { get; private set; } public ItemGroup() { Items = new List<T>(); } },0,2023-07-29T00:00:00Z,0,2023-08-10T00:00:00Z,,283975,283975,6873869,2,6.0,,,0.0,2.0,Answer,283975.0,,2018-03-01,Grozz,60.0,,"San Francisco, CA",5890.0,526.0,446.0,http://groz.github.io/,106630.0
7005118,7005591,1,"I have a machine running gitolite that is used both for code repos and for Sparkleshare. The problem is that Sparkleshare creates it's own key pair; that key pair authenticates first, and has no permissions on the code repos, so gitolite terminates without trying any other pairs. I'm thinking that I may need to figure out how to either tell Sparkleshare to use my original key, or write an alias that forces gitolite to use the correct private key--something I'm not sure is even possible.",4,2023-08-10T00:00:00Z,0,2023-08-10T00:00:00Z,,316184,316184,0,1,2.0,,Is there an easy way to use more than one private ssh key on the same gitolite client?,922.0,1.0,Question,316184.0,,2018-03-26,Bryan Agee,9.0,,"Spokane, WA",3275.0,422.0,266.0,,122395.0
6971875,7006964,1,"""I am using the following line to get a column sum: columns.Bound(item => item.McGross).Width(50).Title(""""Amount"""").Aggregate(aggreages => aggreages.Sum()).Format(""""{0:c}"""").FooterTemplate(result => { %><%= result.Sum.Format(""""{0:c}"""") %><% }); I get error when any of the column valuse are null. How can I use """"if"""" null put """"0"""" for that record. Thanks in advance. """,0,2023-08-07T00:00:00Z,1,2023-08-10T00:00:00Z,,331174,373721,0,1,2.0,,Telerik MVC Grid - Sum error,1392.0,1.0,Question,373721.0,,2018-06-22,hncl,0.0,,,1165.0,19.0,424.0,,156435.0


2.2 Visualize the data using charts and plots


In [0]:
# Best Articles by Views
print("Top 10 Articles by Views")
best_articles = joined_df.orderBy('ViewCount', ascending=False).limit(10)
#best_articles.show()

best_articles_pandas = best_articles.select('Title', 'ViewCount').toPandas()
best_articles_pandas['ViewCount'] = best_articles_pandas['ViewCount'].apply(lambda x: f'{x:,.0f}')
#print(best_articles_pandas)

fig = go.Figure(data=[go.Bar(x=best_articles_pandas['Title'], y=best_articles_pandas['ViewCount'], marker_color='Teal')])
fig.update_layout(xaxis_title='Article Title', yaxis_title='View Count', title='Top 10 Articles by Views')
fig.show()


Top 10 Articles by Views


In [0]:
import plotly.graph_objects as go
import plotly.colors

# Filter out null values in the 'Type' column
filtered_df = joined_df.filter(joined_df['Type'].isNotNull())

# Get the count of each post type
post_types = filtered_df.filter(filtered_df['Type'] != 'NULL') \
                       .groupBy('Type').count() \
                       .orderBy('count', ascending=False).collect()

# Extract the post types and counts
type_names = [row['Type'] for row in post_types]
type_counts = [row['count'] for row in post_types]

# Define the color palette
teal_palette = plotly.colors.sequential.Teal

# Create a pie chart with customized colors
fig = go.Figure(data=go.Pie(labels=type_names, values=type_counts, marker=dict(colors=teal_palette)))
fig.update_layout(title='Post Types')

fig.show()

In [0]:
# Filter out null values in the 'Location' column
filtered_df = joined_df.filter(joined_df['Location'].isNotNull())

# Get the top countries excluding null values
top_countries = filtered_df.filter(filtered_df['Location'] != 'NULL') \
                          .groupBy('Location').count() \
                          .orderBy('count', ascending=False).limit(10)

# Extract the country names and counts
country_names = [row['Location'] for row in top_countries.collect()]
country_counts = [row['count'] for row in top_countries.collect()]

# Create an interactive map
fig = go.Figure(data=go.Scattergeo(
    locations=country_names,
    locationmode='country names',
    text=country_names,  # Added for displaying location names
    marker=dict(
        size=country_counts,
        sizemode='area',
        sizeref=max(country_counts) / 100,
        color=country_counts,
        colorscale='Teal',
        colorbar=dict(title='Count')
    )
))

fig.update_layout(
    title='Top Countries by Location',
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='natural earth',
        landcolor='lightgray',  # Added to change the land color
        showland=True  # Added to show land area
    ),
    height=600,  # Adjust the height of the map
    margin=dict(l=0, r=0, t=40, b=0)  # Adjust the margins
)

fig.show()

2.3 Format the body and tag columns for ML


In [0]:
# Formatting and clean the 'Body' and `Tag` columns for machine learning training
df = (joined_df.withColumn('Body', regexp_replace(joined_df.Body, r'<.*?>', '')) # Transforming HTML code to strings
      .withColumn("Tags", split(trim(translate(col("Tags"), "<>", " ")), " ")) # Making a list of the tags
)
# display(df)

2.4 Filter the data

In [0]:
# Filter by Questions
df = df.filter(col("Type")  == "Question")


2.5 Select columns

In [0]:
df = df.select(col("Body").alias("text"), col("Tags"))
# Producing the tags as individual tags instead of an array
# This is duplicating the posts for each possible tag
df = df.select("text", explode("Tags").alias("tags"))
# display(df)

2.6 Saveing the file in tmp folder

In [0]:
# saving the file as a checkpoint (in case the cluster gets terminated)
df.write.mode('overwrite').parquet("/tmp/project.df.parquet")

# Saving the dataframe to memory for repetitive use
df.cache()
df.count()


2658

# Step 3. Preprocessing the data to prepare it for ML model

3.1. Text Cleaning Preprocessing:

The text in the Body column is preprocessed by removing URLs, special characters, multiple spaces, converting to lowercase, and trimming whitespaces.



In [0]:
# Preprocessing the data 
cleaned = df.withColumn('text',regexp_replace('text', r"http\S+", "")) \
                    .withColumn('text',regexp_replace('text', r"[^a-zA-z]", " ")) \
                    .withColumn('text', regexp_replace('text', r"\s+", " ")) \
                    .withColumn('text', lower('text')) \
                    .withColumn('text', trim('text')) 
display(cleaned)

text,tags
i have an application that collects actions and sends them off to a remote server as these actions aren t time critical think of them as log lines i want to queue them up and send them in batches that way i also want to ensure that no message is ever lost unless the hard drive crashes msmq seems rather heavyweight arcane and weird to use also it needs to be installed as a system component serializing my messages into json and storing them in sqlite is trivial and straight forward but before i do that i wonder if there is a standardized preferably amqp compatible queue that i doesn t require installation and can be embedded into an app,c#
i have an application that collects actions and sends them off to a remote server as these actions aren t time critical think of them as log lines i want to queue them up and send them in batches that way i also want to ensure that no message is ever lost unless the hard drive crashes msmq seems rather heavyweight arcane and weird to use also it needs to be installed as a system component serializing my messages into json and storing them in sqlite is trivial and straight forward but before i do that i wonder if there is a standardized preferably amqp compatible queue that i doesn t require installation and can be embedded into an app,.net
i have an application that collects actions and sends them off to a remote server as these actions aren t time critical think of them as log lines i want to queue them up and send them in batches that way i also want to ensure that no message is ever lost unless the hard drive crashes msmq seems rather heavyweight arcane and weird to use also it needs to be installed as a system component serializing my messages into json and storing them in sqlite is trivial and straight forward but before i do that i wonder if there is a standardized preferably amqp compatible queue that i doesn t require installation and can be embedded into an app,message-queue
i have an application that collects actions and sends them off to a remote server as these actions aren t time critical think of them as log lines i want to queue them up and send them in batches that way i also want to ensure that no message is ever lost unless the hard drive crashes msmq seems rather heavyweight arcane and weird to use also it needs to be installed as a system component serializing my messages into json and storing them in sqlite is trivial and straight forward but before i do that i wonder if there is a standardized preferably amqp compatible queue that i doesn t require installation and can be embedded into an app,rabbitmq
i have an application that collects actions and sends them off to a remote server as these actions aren t time critical think of them as log lines i want to queue them up and send them in batches that way i also want to ensure that no message is ever lost unless the hard drive crashes msmq seems rather heavyweight arcane and weird to use also it needs to be installed as a system component serializing my messages into json and storing them in sqlite is trivial and straight forward but before i do that i wonder if there is a standardized preferably amqp compatible queue that i doesn t require installation and can be embedded into an app,amqp
i m converting an iphone app to the ipad and i m getting stuck on an innocent and seemingly trivial positioning problem i have a uitableviewcell that only contains a uiswitch centered in the cell the cell is in a uitableview with the grouped style on the iphone i merely set the center property of the switch to the center of the cell in tableview cellforrowatindexpath that suffices for the iphone because the center of the cell is the same regardless of table style this happens on the iphone too but the offset is smaller so the difference is more subtle here s the code for the iphone cell [[[uitableviewcell alloc] initwithstyle uitableviewcellstyledefault reuseidentifier cellidentifier] autorelease] uiswitch view [[uiswitch alloc] initwithframe cgrectzero] [view addtarget self action selector switchchanged forcontrolevents uicontroleventvaluechanged] view center cell center cell selectionstyle uitableviewcellselectionstylenone [cell contentview addsubview view] [view release] on the ipad without changing any code the position of the switch is still in the center of an iphone screen that s bad since ipad is way bigger than iphone so i moved the positioning code to tableview willdisplaycell forrowatindexpath and now it s on the other side of center closer to the right side of the screen void tableview uitableview tableview willdisplaycell uitableviewcell cell forrowatindexpath nsindexpath indexpath uiswitch s uiswitch [cell contentview viewwithtag switchtag] s center cgpointmake cell contentview center x s center y pushes the switch too far to the right and yet when i log the various coordinates and frames that i m interested in i get what i expect the x coordinate of the center of the cell is at which is half the screen width i ve also tried recalculating the whole frame of the uiswitch but since i m using the values of the screen the switch ends up in the same spot too far to the right what i think is going on is that there is some transform after tableview willdisplaycell forrowatindexpath is called or before even to render all those numbers meaningless it almost looks as if the whole cell gets shifted by x where x is the distance from the edge of the screen to the start of the cell even though that s not reported in the frame property i m sure there s something trivial i m missing please enlighten me,iphone
i m converting an iphone app to the ipad and i m getting stuck on an innocent and seemingly trivial positioning problem i have a uitableviewcell that only contains a uiswitch centered in the cell the cell is in a uitableview with the grouped style on the iphone i merely set the center property of the switch to the center of the cell in tableview cellforrowatindexpath that suffices for the iphone because the center of the cell is the same regardless of table style this happens on the iphone too but the offset is smaller so the difference is more subtle here s the code for the iphone cell [[[uitableviewcell alloc] initwithstyle uitableviewcellstyledefault reuseidentifier cellidentifier] autorelease] uiswitch view [[uiswitch alloc] initwithframe cgrectzero] [view addtarget self action selector switchchanged forcontrolevents uicontroleventvaluechanged] view center cell center cell selectionstyle uitableviewcellselectionstylenone [cell contentview addsubview view] [view release] on the ipad without changing any code the position of the switch is still in the center of an iphone screen that s bad since ipad is way bigger than iphone so i moved the positioning code to tableview willdisplaycell forrowatindexpath and now it s on the other side of center closer to the right side of the screen void tableview uitableview tableview willdisplaycell uitableviewcell cell forrowatindexpath nsindexpath indexpath uiswitch s uiswitch [cell contentview viewwithtag switchtag] s center cgpointmake cell contentview center x s center y pushes the switch too far to the right and yet when i log the various coordinates and frames that i m interested in i get what i expect the x coordinate of the center of the cell is at which is half the screen width i ve also tried recalculating the whole frame of the uiswitch but since i m using the values of the screen the switch ends up in the same spot too far to the right what i think is going on is that there is some transform after tableview willdisplaycell forrowatindexpath is called or before even to render all those numbers meaningless it almost looks as if the whole cell gets shifted by x where x is the distance from the edge of the screen to the start of the cell even though that s not reported in the frame property i m sure there s something trivial i m missing please enlighten me,objective-c
i m converting an iphone app to the ipad and i m getting stuck on an innocent and seemingly trivial positioning problem i have a uitableviewcell that only contains a uiswitch centered in the cell the cell is in a uitableview with the grouped style on the iphone i merely set the center property of the switch to the center of the cell in tableview cellforrowatindexpath that suffices for the iphone because the center of the cell is the same regardless of table style this happens on the iphone too but the offset is smaller so the difference is more subtle here s the code for the iphone cell [[[uitableviewcell alloc] initwithstyle uitableviewcellstyledefault reuseidentifier cellidentifier] autorelease] uiswitch view [[uiswitch alloc] initwithframe cgrectzero] [view addtarget self action selector switchchanged forcontrolevents uicontroleventvaluechanged] view center cell center cell selectionstyle uitableviewcellselectionstylenone [cell contentview addsubview view] [view release] on the ipad without changing any code the position of the switch is still in the center of an iphone screen that s bad since ipad is way bigger than iphone so i moved the positioning code to tableview willdisplaycell forrowatindexpath and now it s on the other side of center closer to the right side of the screen void tableview uitableview tableview willdisplaycell uitableviewcell cell forrowatindexpath nsindexpath indexpath uiswitch s uiswitch [cell contentview viewwithtag switchtag] s center cgpointmake cell contentview center x s center y pushes the switch too far to the right and yet when i log the various coordinates and frames that i m interested in i get what i expect the x coordinate of the center of the cell is at which is half the screen width i ve also tried recalculating the whole frame of the uiswitch but since i m using the values of the screen the switch ends up in the same spot too far to the right what i think is going on is that there is some transform after tableview willdisplaycell forrowatindexpath is called or before even to render all those numbers meaningless it almost looks as if the whole cell gets shifted by x where x is the distance from the edge of the screen to the start of the cell even though that s not reported in the frame property i m sure there s something trivial i m missing please enlighten me,cocoa-touch
i m converting an iphone app to the ipad and i m getting stuck on an innocent and seemingly trivial positioning problem i have a uitableviewcell that only contains a uiswitch centered in the cell the cell is in a uitableview with the grouped style on the iphone i merely set the center property of the switch to the center of the cell in tableview cellforrowatindexpath that suffices for the iphone because the center of the cell is the same regardless of table style this happens on the iphone too but the offset is smaller so the difference is more subtle here s the code for the iphone cell [[[uitableviewcell alloc] initwithstyle uitableviewcellstyledefault reuseidentifier cellidentifier] autorelease] uiswitch view [[uiswitch alloc] initwithframe cgrectzero] [view addtarget self action selector switchchanged forcontrolevents uicontroleventvaluechanged] view center cell center cell selectionstyle uitableviewcellselectionstylenone [cell contentview addsubview view] [view release] on the ipad without changing any code the position of the switch is still in the center of an iphone screen that s bad since ipad is way bigger than iphone so i moved the positioning code to tableview willdisplaycell forrowatindexpath and now it s on the other side of center closer to the right side of the screen void tableview uitableview tableview willdisplaycell uitableviewcell cell forrowatindexpath nsindexpath indexpath uiswitch s uiswitch [cell contentview viewwithtag switchtag] s center cgpointmake cell contentview center x s center y pushes the switch too far to the right and yet when i log the various coordinates and frames that i m interested in i get what i expect the x coordinate of the center of the cell is at which is half the screen width i ve also tried recalculating the whole frame of the uiswitch but since i m using the values of the screen the switch ends up in the same spot too far to the right what i think is going on is that there is some transform after tableview willdisplaycell forrowatindexpath is called or before even to render all those numbers meaningless it almost looks as if the whole cell gets shifted by x where x is the distance from the edge of the screen to the start of the cell even though that s not reported in the frame property i m sure there s something trivial i m missing please enlighten me,ipad
i m converting an iphone app to the ipad and i m getting stuck on an innocent and seemingly trivial positioning problem i have a uitableviewcell that only contains a uiswitch centered in the cell the cell is in a uitableview with the grouped style on the iphone i merely set the center property of the switch to the center of the cell in tableview cellforrowatindexpath that suffices for the iphone because the center of the cell is the same regardless of table style this happens on the iphone too but the offset is smaller so the difference is more subtle here s the code for the iphone cell [[[uitableviewcell alloc] initwithstyle uitableviewcellstyledefault reuseidentifier cellidentifier] autorelease] uiswitch view [[uiswitch alloc] initwithframe cgrectzero] [view addtarget self action selector switchchanged forcontrolevents uicontroleventvaluechanged] view center cell center cell selectionstyle uitableviewcellselectionstylenone [cell contentview addsubview view] [view release] on the ipad without changing any code the position of the switch is still in the center of an iphone screen that s bad since ipad is way bigger than iphone so i moved the positioning code to tableview willdisplaycell forrowatindexpath and now it s on the other side of center closer to the right side of the screen void tableview uitableview tableview willdisplaycell uitableviewcell cell forrowatindexpath nsindexpath indexpath uiswitch s uiswitch [cell contentview viewwithtag switchtag] s center cgpointmake cell contentview center x s center y pushes the switch too far to the right and yet when i log the various coordinates and frames that i m interested in i get what i expect the x coordinate of the center of the cell is at which is half the screen width i ve also tried recalculating the whole frame of the uiswitch but since i m using the values of the screen the switch ends up in the same spot too far to the right what i think is going on is that there is some transform after tableview willdisplaycell forrowatindexpath is called or before even to render all those numbers meaningless it almost looks as if the whole cell gets shifted by x where x is the distance from the edge of the screen to the start of the cell even though that s not reported in the frame property i m sure there s something trivial i m missing please enlighten me,uitableview


# Step 4. Machine Learning Model Training and Evalution

4.1 Feature transformation is performed on the preprocessed text.

4.2: The labels (Tags) are encoded using StringIndexer.

4.3: A Logistic Regression model is trained on the features and labels and the model is evaluated using accuracy and ROC-AUC scores.



4.1 Feature transformation.
- Tokenization: The text is tokenized into words.
- Stopword Removal: Common stopwords are removed.
- CountVectorizer: The tokenized words are transformed into a vector of term frequencies.
- TF-IDF Vectorization: The term frequencies are transformed into TF-IDF features.

In [0]:

# Tokenization
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import StringIndexer


tokenizer = Tokenizer(inputCol= "text", outputCol="tokens")
tokenized = tokenizer.transform(cleaned)

stopword_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered")
stopword = stopword_remover.transform(tokenized)

cv = CountVectorizer(vocabSize=2**16, inputCol="filtered", outputCol='cv')
cv_model = cv.fit(stopword)
text_cv = cv_model.transform(stopword)

idf = IDF(inputCol='cv', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
idf_model = idf.fit(text_cv)
text_idf = idf_model.transform(text_cv)


label_encoder = StringIndexer(inputCol = "tags", outputCol = "label")
le_model = label_encoder.fit(text_idf)
final = le_model.transform(text_idf)

#display(final)

4.3 Model Training and Evalution.

In [0]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Model Training
lr = LogisticRegression(maxIter=100)
lr_model = lr.fit(final)
predictions = lr_model.transform(final)
#display(predictions)

# Model Evalution
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
roc_auc = evaluator.evaluate(predictions)
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(predictions.count())
#print("Model Accuracy: {:.2f}%".format(accuracy * 100))
#print("ROC-AUC: {:.2f}%".format(roc_auc * 100))


# display(predictions)


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

# Step 5. Create a Pipeline
A pipeline is created to automate the preprocessing and modeling steps. The data is split into train and test sets, and the pipeline is fitted and used to make predictions on the test set.

In [0]:
# Importing all the libraries
from pyspark.sql.functions import split, translate, trim, explode, regexp_replace, col, lower
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Preparing the data
# Step 1: Creating the joined table
df = posts.join(postType, posts.PostTypeId == postType.Id, "inner")
#joined_df = posts.join(postType, posts.PostTypeId == postType.Id, "inner")

# Step 2: Selecting only Question posts
df = df.filter(col("Type") == "Question")
# Step 3: Formatting the raw data
df = (df.withColumn('Body', regexp_replace(df.Body, r'<.*?>', ''))
      .withColumn("Tags", split(trim(translate(col("Tags"), "<>", " ")), " "))
)
# Step 4: Selecting the columns
df = df.select(col("Body").alias("text"), col("Tags"))
# Step 5: Getting the tags
df = df.select("text", explode("Tags").alias("tags"))
# Step 6: Clean the text
cleaned = df.withColumn('text', regexp_replace('text', r"http\S+", "")) \
                    .withColumn('text', regexp_replace('text', r"[^a-zA-z]", " ")) \
                    .withColumn('text', regexp_replace('text', r"\s+", " ")) \
                    .withColumn('text', lower('text')) \
                    .withColumn('text', trim('text')) 

# Machine Learning
# Step 1: Train Test Split
train, test = cleaned.randomSplit([0.9, 0.1], seed=20200819)
# Step 2: Initializing the transfomers
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
stopword_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered")
cv = CountVectorizer(vocabSize=2**16, inputCol="filtered", outputCol='cv')
idf = IDF(inputCol='cv', outputCol="features", minDocFreq=5)
label_encoder = StringIndexer(inputCol = "tags", outputCol = "label")
lr = LogisticRegression(maxIter=100)
# Step 3: Creating the pipeline
pipeline = Pipeline(stages=[tokenizer, stopword_remover, cv, idf, label_encoder, lr])
# Step 4: Fitting and transforming (predicting) using the pipeline
pipeline_model = pipeline.fit(train)
predictions = pipeline_model.transform(test)


Downloading artifacts:   0%|          | 0/34 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

# Step 6. Save the Model file to Azure storage
The trained pipeline model and the StringIndexer model are saved to Azure storage.

In [0]:
# Saving model object to the /mnt/deBDProject directory. Yours name may be different.
pipeline_model.save('/mnt/deBDProject/model')

# Save the the String Indexer to decode the encoding. We need it in the future Sentiment Analysis.
le_model.save('/mnt/deBDProject/stringindexer')

# Review the directory
display(dbutils.fs.ls("/mnt/deBDProject/model"))

path,name,size,modificationTime
dbfs:/mnt/deBDProject/model/metadata/,metadata/,0,1715758357000
dbfs:/mnt/deBDProject/model/stages/,stages/,0,1715758357000
