# INF6032 Big Data Analytics
### 240175267
<br>
This notebook contains the PySpark implementations for the assessment questions and extensions.

### Initial Setup:
- Import necessary libraries
- Load **large** and **MAGPIE_unfiltered** DataFrames from **large.csv.gz** and **MAGPIE_unfiltered.jsonl**
- Inspect data

In [0]:
from pyspark.sql import functions as F
from pyspark.ml.feature import NGram
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.window import Window

In [0]:
large = spark.read.csv("/FileStore/tables/large.csv.gz", header = True)

MAGPIE_unfiltered = spark.read.json("/FileStore/tables/MAGPIE_unfiltered.jsonl")

In [0]:
large.printSchema()

root
 |-- sentence: string (nullable = true)
 |-- source: string (nullable = true)



In [0]:
display(large.limit(10))

sentence,source
"""The specific epithet """"seemannii"""" refers to someone with the surname 'Seemann","' in many cases it's botanist Berthold Carl Seemann (1825–1871)."""
Adult and pediatric nurse practitioner programs began in 1971.,pages_articles24
"""He received the """"Naim Frashëri"""" award from the Albanian presidency.""",pages_articles24
He competed for Germany in the 2018 Winter Olympics.,pages_articles24
"Despite an increase in the number of votes, Fan failed to win the re-election.",pages_articles24
"After the match, Cabana challenged Aries to a steel cage match for the title at Third Anniversary Celebration.",pages_articles24
Alexander Moissi (; ; 2 April 1879 – 22 March 1935) was an Austrian stage actor (and occasional film actor) of Albanian origin.,pages_articles24
His brother is the biomathematician Wolfgang Alt.,pages_articles24
"The frontage, refurbished in 2017, displays the carved door crowned with a curved fan light and several cartouches, rosettes and other motifs in the Art Nouveau fashion.",pages_articles24
"Tanaka was born in New York City, U.S. on 23 November 1986.",pages_articles24


In [0]:
MAGPIE_unfiltered.printSchema()

root
 |-- confidence: double (nullable = true)
 |-- context: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- document_id: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- id: long (nullable = true)
 |-- idiom: string (nullable = true)
 |-- judgment_count: long (nullable = true)
 |-- label: string (nullable = true)
 |-- label_distribution: struct (nullable = true)
 |    |-- ?: double (nullable = true)
 |    |-- f: double (nullable = true)
 |    |-- i: double (nullable = true)
 |    |-- l: double (nullable = true)
 |    |-- o: double (nullable = true)
 |-- non_standard_usage_explanations: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- offsets: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: long (containsNull = true)
 |-- sentence_no: string (nullable = true)
 |-- split: string (nullable = true)
 |-- variant_type: string (nullable = true)



In [0]:
display(MAGPIE_unfiltered.limit(10))

confidence,context,document_id,genre,id,idiom,judgment_count,label,label_distribution,non_standard_usage_explanations,offsets,sentence_no,split,variant_type
1.0,"List(, , One can not come to terms with the past , one can not find peace or reconciliation unless one faces up to history in its entirety ., , )",p39d1118,PMB,0,come to terms with,3,i,"List(0.0, 0.0, 1.0, 0.0, 0.0)",List(),"List(List(12, 16), List(17, 19), List(20, 25), List(26, 30))",0,training,identical
1.0,"List(And there may be one or two other things we work through over this meeting and the next meeting ., We are close to the end of erm You should be lively signing this document ., We might call it a day at the end of the assignments and take Jenny 's away to recap , erm lets get to the end of the assignments first , Q P9 ., Where is Q P9 ? papers rustling This one I think we should give a little thought , because this is the one I think where oh yes this is going to get complicated because I have now been given a quiff and a few suggestion forms about it, You said you did n't have the print of it , is that right ?)",J97,S meeting,1,call it a day,3,i,"List(0.0, 0.0, 1.0, 0.0, 0.0)",List(),"List(List(9, 13), List(14, 16), List(19, 22))",470,training,identical
1.0,"List(It was the first thing I asked.’, Lindsey nodded ., ‘ Well , it 's a recognised symptom of the condition that an attack can come out of the blue ., Let 's go and have a chat with him and see if we can put his mind at rest.’, Jill followed her to the ward where Mr Deakin was sitting in a chair beside the bed .)",JXW,W fict prose,2,out of the blue,3,i,"List(0.0, 0.0, 1.0, 0.0, 0.0)",List(),"List(List(77, 80), List(81, 83), List(88, 92))",1202,training,identical
1.0,"List(When I took these two guitars into the studio to try them out , over the course of the first hour everyone and their cat called in and could n't resist having a go ., Interestingly , these instruments complement each other very well tonally , and as we muddled through everything from new age to rock to the Isley Brothers greatest hits , the EGs seemed the ideal instruments for every job ., Taking the EG-1 first , the HFS bridge humbucker delivers a warm , fat sound reminiscent of a vintage PAF , and while the guitar does n't have the image of a rocker the bridge pickup is almost arrogant in the way it handles an amp at full tilt ., With the same amp settings the low hum single coils sing with almost the character of a Gibson P90 ., With a clean tone they 're warm and solid - sounding , offering just the right amount of signal to cause the amp to get gritty when the guitar 's volume pot is whacked full up .)",C9M,W pop lore,3,full tilt,5,i,"List(0.0, 0.0, 1.0, 0.0, 0.0)",List(),"List(List(234, 238), List(239, 243))",1117,training,identical
1.0,"List(Now , generally speaking wi I I I 've act what I 've done for the re - sit paper is that I I pooled a load of questions , some of which went to the first paper and some of which went to the ne and some of which went to the re - sit paper so I do n't know which topics are coming up erm on the re - sit paper ., However , they will be drawn from the same list of topics that 're on the first and there may be some overlap ., Generally speaking , re - sit paper erm will contain at least one question very similar to a question on the first paper ., Er I , does anybody know where the past papers are kept ?, Yeah , in the library , that 's where I 'd usually go and look for them)",JT1,S lect soc science,4,on paper,3,l,"List(0.0, 0.0, 0.0, 1.0, 0.0)",List(),"List(List(102, 104), List(115, 120))",291,training,insertion-other
0.7205235271756356,"List(LETTERS, Cropspray tests, We refer to D. R. Goldsmith 's letter on ' Human cropspray tests'(Letters , 10 March p 677 ) ., The publication by Rao et al referred to by D. R. Goldsmith in his letter , is based on reports by Ciba - Geigy of India Ltd., These were submitted to the Central Insecticides Board , Faridabad , upon the authority 's request .)",B7C,W nonAc: nat science,5,to the letter,10,l,"List(0.0, 0.27947647282436444, 0.0, 0.7205235271756356, 0.0)",List(),"List(List(9, 11), List(31, 37))",2076,training,combined-other
1.0,"List(Perhaps I am becoming too gullible in my old age ., I believed that he might bring forward such an amendment , but he has not done so ., The Prime Minister believes that her words strike a chord with the British public ., I shall not allow myself to be tempted to discuss devolution , Mr. Deputy Speaker , apart from getting into trouble with you , I should get into trouble with my Hon . Friends ., Let me , however , refer briefly to the speech in which the Prime Minister said that she was totally against devolution .)",G3H,W hansard,6,strike a chord,3,i,"List(0.0, 0.0, 1.0, 0.0, 0.0)",List(),"List(List(43, 49), List(52, 57))",1682,training,identical
0.729621093086573,"List(Once his immediate knowledge had been used up his value , particularly after the Cyprus spy fiasco , was a fast - diminishing asset ., He is now a forgotten man who will while away his twilight years alone and insecure in an alien land , longing to return to Russia . The KGB has not been slow to remind its senior intelligence personnel of the isolated lifestyle that awaits them in the West should they be tempted to defect , pointing out the benefits of enjoying a relatively luxurious and secure existence in Russia ., It has also sent the West numerous doubtful and bogus defectors to muddy the waters ., One of the earliest was Oleg Penkovsky , a lieutenant - colonel in the GRU , who approached MI6 in 1960 with offers of information after having been twice rebuffed by the CIA ., A businessman called Greville Wynne was asked to act as a freelance MI6 contact and for the next two years Penkovsky provided an incredible wealth of intimate detail about the Russians ' innermost plans including the period of the 1962 Cuban missile crisis .)",AN0,W nonAc: polit law edu,7,muddy the waters,4,i,"List(0.0, 0.0, 0.729621093086573, 0.27037890691342703, 0.0)",List(),"List(List(67, 72), List(77, 83))",956,training,identical
0.6811264106015117,"List(Standing in front , was a knee - high block with a large sharp axe lying across it ., Picking up the axe as though it were no heavier than a bread - knife , Jos indicated the unsplit wood ., ‘ You keep 'em coming,’ he said , ‘ and I 'll do the honours.’, Mungo set to , enjoying the work ., Keeping Jos supplied meant non - stop motion .)",ACV,W fict prose,8,do the honours,3,i,"List(0.0, 0.0, 0.6811264106015117, 0.3188735893984884, 0.0)",List(),"List(List(46, 48), List(53, 62))",1611,training,dashes
1.0,"List(7.4 High - precision radiocarbon calibration curve based on Irish oak ( courtesy of Gordon Pearson ) ., 7.5 Section of Stuiver and Pearson 's high - precision calibration curve for the recent past ( courtesy of the authors ) ., the highest and lowest lines show the error term on the curve ( centre line ) at each point ., Large wiggles mean that a single radiocarbon result can correspond to more than one calendar result ( see fig . 7.8 ) ; distinguishing between the different calendar possibilities can not then be achieved by radiocarbon alone ., 7.6 above left Polished section of the trunk of an oak showing the well - defined rings ( courtesy Jonathan Pilcher ) .)",AC9,W nonAc: humanities arts,9,high and low,3,l,"List(0.0, 0.0, 0.0, 1.0, 0.0)",List(),"List(List(4, 11), List(12, 15), List(16, 22))",635,training,misc


### Question 1.
Calculate the number of difference sentences in the dataset.

In [0]:
large_different_sentence_count = large.select(F.countDistinct("sentence")).first()[0]

print(f"\nThe number of different sentences in the large dataset is: {large_different_sentence_count}")


The number of different sentences in the large dataset is: 389639


### Question 1 - Extension.

In [0]:

large_normalised = (large
    .withColumn("sentence_lower", F.lower(F.col("sentence")))
    .withColumn("sentence_no_punct", F.regexp_replace(F.col("sentence"), "[^\\w\\s]+", ""))
    .withColumn("sentence_normalised", F.lower(F.col("sentence_no_punct"))
    )
)

counts = (large_normalised.agg(
    F.countDistinct("sentence_lower").alias("lower_distinct_sentences"),
    F.countDistinct("sentence_no_punct").alias("no_punct_distinct_sentences"),
    F.countDistinct("sentence_normalised").alias("normalised_distinct_sentences")
    ).first()
)

large_different_sentence_count_lower = counts["lower_distinct_sentences"]

large_different_sentence_count_no_punct = counts["no_punct_distinct_sentences"]

large_different_sentence_count_normalised = counts["normalised_distinct_sentences"]

print(f"\nThe effect of normalisation techniques on the number of different sentences in the large dataset:\n")
print(f"The number of different sentences (without normalisation) is: {large_different_sentence_count}")
print(f"The number of different sentences (case-insensitive) is: {large_different_sentence_count_lower}")
print(f"The number of different sentences (without punctuation) is: {large_different_sentence_count_no_punct}")
print(f"The number of different sentences (case-insensitive, without punctuation) is: {large_different_sentence_count_normalised}")


The effect of normalisation techniques on the number of different sentences in the large dataset:

The number of different sentences (without normalisation) is: 389639
The number of different sentences (case-insensitive) is: 389595
The number of different sentences (without punctuation) is: 389463
The number of different sentences (case-insensitive, without punctuation) is: 389412


### Question 2.
List the numbers of words in the 10 longest sentences.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_word_counts = large_words.withColumn("word_count", F.size(F.col("words"))).orderBy(F.desc("word_count"))

large_10_longest = large_word_counts.select(F.col("word_count").alias("Word Count")).limit(10)

print("\nWord counts of the 10 longest sentences in the large dataset:\n")
large_10_longest.display()


Word counts of the 10 longest sentences in the large dataset:



Word Count
4571
2499
562
528
426
413
382
381
348
335


### Question 2 - Extension.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_word_counts = large_words.withColumn("word_count", F.size("words")).orderBy(F.desc("word_count"))

large_overall_stats = (large_word_counts.select(
    F.mean("word_count").alias("mean_word_count_overall"),
    F.stddev("word_count").alias("stddev_word_count_overall"),
    F.max("word_count").alias("max_word_count_overall"),
    F.min("word_count").alias("min_word_count_overall")
    ).first()
)

large_10_longest = large_word_counts.select("word_count").limit(10)

large_10_longest_stats = (large_10_longest.select(
    F.mean("word_count").alias("mean_word_count_10_longest"),
    F.stddev("word_count").alias("stddev_word_count_10_longest"),
    F.max("word_count").alias("max_word_count_10_longest"),
    F.min("word_count").alias("min_word_count_10_longest")
    ).first()
)

print("\nOverall sentence statistics for the large dataset:\n")
print(f"Mean word count: {large_overall_stats['mean_word_count_overall']:.2f}")
print(f"Standard deviation of word count: {large_overall_stats['stddev_word_count_overall']:.2f}")
print(f"Maximum word count: {large_overall_stats['max_word_count_overall']}")
print(f"Minimum word count: {large_overall_stats['min_word_count_overall']}")

print("\n\n10 longest sentences statistics for the large dataset:\n")
print(f"Mean word count: {large_10_longest_stats['mean_word_count_10_longest']:.2f}")
print(f"Standard deviation of word count: {large_10_longest_stats['stddev_word_count_10_longest']:.2f}")
print(f"Maximum word count: {large_10_longest_stats['max_word_count_10_longest']}")
print(f"Minimum word count: {large_10_longest_stats['min_word_count_10_longest']}")


Overall sentence statistics for the large dataset:

Mean word count: 19.04
Standard deviation of word count: 13.27
Maximum word count: 4571
Minimum word count: 1


10 longest sentences statistics for the large dataset:

Mean word count: 1044.50
Standard deviation of word count: 1402.39
Maximum word count: 4571
Minimum word count: 335


### Question 3.
If we define a bigram as a pair of consecutive words, find the average number of bigrams\
per sentence across the dataset (Note that the **pyspark.ml.feature** module contains\
a function which can help you find bigrams.)

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_bigram_transformer = NGram(n = 2, inputCol = "words", outputCol = "bigrams")
large_bigrams = large_bigram_transformer.transform(large_words)

large_average_bigrams = large_bigrams.select(F.avg(F.size(F.col("bigrams")))).first()[0]

print(f"\nThe average number of bigrams per sentence in the large dataset is: {large_average_bigrams}")


The average number of bigrams per sentence in the large dataset is: 18.0365125


### Question 3 - Extension.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_trigram_transformer = NGram(n = 3, inputCol = "words", outputCol = "trigrams")
large_trigrams = large_trigram_transformer.transform(large_words)

large_average_trigrams = large_trigrams.select(F.avg(F.size(F.col("trigrams")))).first()[0]

print(f"\nThe average number of trigrams per sentence in the large dataset is: {large_average_trigrams}")


The average number of trigrams per sentence in the large dataset is: 17.037275


### Question 4.
Find the 10 most frequent bigrams in the dataset. **Ensure that all 10 answers, along\
with their frequencies, are visible in your report.**

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_bigram_transformer = NGram(n = 2, inputCol = "words", outputCol = "bigrams")
large_bigrams = large_bigram_transformer.transform(large_words)

large_exploded_bigrams = large_bigrams.select(F.explode(F.col("bigrams")).alias("Bigram"))

large_bigram_counts = large_exploded_bigrams.groupBy("Bigram").count()

large_most_freq_bigrams = (large_bigram_counts.orderBy(F.col("count").desc()).limit(10).select(
                                                                                         F.col("Bigram"),
                                                                                         F.col("count").alias("Count")
                                                                                         )
)

print("\nThe 10 most frequent bigrams in the large dataset:\n")
large_most_freq_bigrams.display()


The 10 most frequent bigrams in the large dataset:



Bigram,Count
of the,76294
in the,54058
to the,25486
at the,21596
is a,19316
for the,17946
on the,16050
and the,15824
as a,13240
with the,11929


### Question 4 - Alternative.
Alternative implementation using RDD.

In [0]:
large_sentence_rdd = large.select("sentence").rdd.map(lambda row: row.sentence)

large_bigram_rdd = large_sentence_rdd.flatMap(
    lambda sentence:
        (lambda words: [words[i] + " " + words[i+1] for i in range(len(words) - 1)] if len(words) >= 2 else [])
        (sentence.split())
)

large_bigram_pairs_rdd = large_bigram_rdd.map(lambda bigram: (bigram, 1))

large_bigram_counts_rdd = large_bigram_pairs_rdd.reduceByKey(lambda x, y: x + y)

large_sorted_bigram_counts_rdd = large_bigram_counts_rdd.sortBy(lambda item: item[1], ascending = False)

large_10_most_freq_bigrams = large_sorted_bigram_counts_rdd.take(10)

print("\nThe 10 most frequent bigrams in the large dataset\n(Alternative implementation using RDD):\n")
spark.createDataFrame(large_10_most_freq_bigrams, ["Bigram", "Count"]).display()


The 10 most frequent bigrams in the large dataset
(Alternative implementation using RDD):



Bigram,Count
of the,76294
in the,54058
to the,25486
at the,21596
is a,19316
for the,17946
on the,16050
and the,15824
as a,13240
with the,11929


### Question 4 - Extension 1.

In [0]:
large_words_lower = large.withColumn("words", F.split(F.lower(F.col("sentence")), "\\s+"))

stop_word_remover = StopWordsRemover(inputCol = "words", outputCol = "filtered_words")

large_filtered_words_lower = stop_word_remover.transform(large_words_lower)

large_bigram_transformer = NGram(n = 2, inputCol = "filtered_words", outputCol = "filtered_bigrams")
large_filtered_bigrams = large_bigram_transformer.transform(large_filtered_words_lower)

large_exploded_filtered_bigrams = large_filtered_bigrams.select(F.explode(F.col("filtered_bigrams")).alias("Filtered Bigram"))

large_filtered_bigram_counts = large_exploded_filtered_bigrams.groupBy("Filtered Bigram").count()

large_most_freq_filtered_bigrams = (large_filtered_bigram_counts.orderBy(F.col("count").desc()).limit(10).select(
                                                                                                           F.col("Filtered Bigram"),
                                                                                                           F.col("count").alias("Count")
                                                                                                           )
)

print("\nThe 10 most frequent bigrams in the large dataset after lowercasing and stop word removal:\n")
large_most_freq_filtered_bigrams.display()


The 10 most frequent bigrams in the large dataset after lowercasing and stop word removal:



Filtered Bigram,Count
united states,2591
new york,2014
high school,1386
took place,1218
winter olympics.,1133
refer to:,1046
world war,1037
table tennis,908
may refer,904
early life,829


### Question 4 - Extension 2.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_trigram_transformer = NGram(n = 3, inputCol = "words", outputCol = "trigrams")
large_trigrams = large_trigram_transformer.transform(large_words)

large_exploded_trigrams = large_trigrams.select(F.explode(F.col("trigrams")).alias("Trigram"))

large_trigram_counts = large_exploded_trigrams.groupBy("Trigram").count()

large_most_freq_trigrams = (large_trigram_counts.orderBy(F.col("count").desc()).limit(10).select(
                                                                                           F.col("Trigram"),
                                                                                           F.col("count").alias("Count")
                                                                                           )
)

print("\nThe 10 most frequent trigrams in the large dataset:\n")
large_most_freq_trigrams.display()


The 10 most frequent trigrams in the large dataset:



Trigram,Count
one of the,3307
the University of,3096
as well as,2991
member of the,2768
part of the,2668
a member of,2393
was born in,2157
the end of,1933
the United States,1875
at the University,1494


### Question 4 - Extension 3.

In [0]:
large_words_lower = large.withColumn("words", F.split(F.lower(F.col("sentence")), "\\s+"))

stop_word_remover = StopWordsRemover(inputCol = "words", outputCol = "filtered_words")

large_filtered_words_lower = stop_word_remover.transform(large_words_lower)

large_trigram_transformer = NGram(n = 3, inputCol = "filtered_words", outputCol = "filtered_trigrams")
large_filtered_trigrams = large_trigram_transformer.transform(large_filtered_words_lower)

large_exploded_filtered_trigrams = large_filtered_trigrams.select(F.explode(F.col("filtered_trigrams")).alias("Filtered Trigram"))

large_filtered_trigram_counts = large_exploded_filtered_trigrams.groupBy("Filtered Trigram").count()

large_most_freq_filtered_trigrams = (large_filtered_trigram_counts.orderBy(F.col("count").desc()).limit(10).select(
                                                                                                             F.col("Filtered Trigram"),
                                                                                                             F.col("count").alias("Count")
                                                                                                             )
)

print("\nThe 10 most frequent trigrams in the large dataset after lowercasing and stop word removal:\n")
large_most_freq_filtered_trigrams.display()


The 10 most frequent trigrams in the large dataset after lowercasing and stop word removal:



Filtered Trigram,Count
may refer to:,824
early life education.,604
world table tennis,582
table tennis championships,533
national register historic,482
register historic places,451
professional footballer plays,439
listed national register,391
new york city,340
people surname include:,326


### Question 5.
Since bigrams don’t have to make sense, it’d be useful to gain a bit more information\
about those that do. The MAGPIE dataset contains idioms, which are phrases or\
expressions that have figurative meaning that’s different from the literal definition of\
the words. Find out how many of the bigrams you’ve extracted from the Wikipedia\
subset appear in the list of idioms contained in the MAGPIE subset.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_bigram_transformer = NGram(n = 2, inputCol = "words", outputCol = "bigrams")
large_bigrams = large_bigram_transformer.transform(large_words)

large_exploded_bigrams = (large_bigrams.select(F.explode(F.col("bigrams")).alias("bigram")).distinct())

MAGPIE_unfiltered_idioms = MAGPIE_unfiltered.select("idiom").distinct()

large_matching = large_exploded_bigrams.join(MAGPIE_unfiltered_idioms, F.col("bigram") == F.col("idiom"), "inner")

count = large_matching.count()

print(f"\nNumber of bigrams in the large dataset matching MAGPIE_unfiltered idioms: {count}")


Number of bigrams in the large dataset matching MAGPIE_unfiltered idioms: 67


### Question 5 - Alternative.
Alternative implementation using PySpark SQL.

In [0]:
large_bigram_counts.createOrReplaceTempView("large_bigram_counts_view")

MAGPIE_unfiltered.createOrReplaceTempView("magpie_unfiltered_view")

large_matching_sql_query = """
SELECT
    COUNT(table_1.bigram) as matching_count
FROM
    (SELECT bigram FROM large_bigram_counts_view) AS table_1
INNER JOIN
    (SELECT DISTINCT idiom FROM magpie_unfiltered_view) AS table_2
ON
    table_1.bigram = table_2.idiom
"""

large_matching_sql = spark.sql(large_matching_sql_query)

if large_matching_sql.count() > 0:
    count = large_matching_sql.first()["matching_count"]
else:
    count = 0

print(f"\nNumber of bigrams in the large dataset matching MAGPIE_unfiltered idioms\n(Alternative implementation using PySpark SQL): {count}")


Number of bigrams in the large dataset matching MAGPIE_unfiltered idioms
(Alternative implementation using PySpark SQL): 67


### Question 5 - Extension.

In [0]:
large_matching = large_exploded_bigrams.join(MAGPIE_unfiltered_idioms, F.col("bigram") == F.col("idiom"), "inner")

distinct_matching_bigrams = large_matching.select("bigram")

matching_bigrams_with_freq = distinct_matching_bigrams.join(large_bigram_counts, "bigram", "inner")

_10_most_frequent_matches = (matching_bigrams_with_freq.orderBy(F.desc("count")).limit(10).select(
                                                                                            F.col("bigram").alias("Bigram"),
                                                                                            F.col("count").alias("Count")
                                                                                            )
)

print(f"\nThe 10 most frequent bigrams common to both the large dataset and MAGPIE_unfiltered (with frequencies from the large dataset):\n")
_10_most_frequent_matches.display()


The 10 most frequent bigrams common to both the large dataset and MAGPIE_unfiltered (with frequencies from the large dataset):



Bigram,Count
on board,82
game on,74
in business,53
spot on,40
at sea,20
for Africa,15
on paper,10
in tandem,10
hot air,9
lone wolf,8


### Question 6.
Ensuring that you are only considering the bigrams that appear in Wikipedia and not in\
MAGPIE, print out the 10 bigrams starting from rank 2500 when these are ordered by\
decreasing frequency (ensure alphabetical order for same frequency bigrams) i.e. your\
output should start with the bigram ranked 2500, and finish with the bigram ranked\
2510.

In [0]:
large_words = large.withColumn("words", F.split(F.col("sentence"), "\\s+"))

large_bigram_transformer = NGram(n = 2, inputCol = "words", outputCol = "bigrams")
large_bigrams = large_bigram_transformer.transform(large_words)

large_exploded_bigrams = large_bigrams.select(F.explode(("bigrams")).alias("bigram"))

large_bigram_counts = large_exploded_bigrams.groupBy("bigram").count()

MAGPIE_unfiltered_idioms = MAGPIE_unfiltered.select("idiom").distinct()

large_only_bigrams = large_bigram_counts.join(MAGPIE_unfiltered_idioms, F.col("bigram") == F.col("idiom"), "left_anti",)

ranked_bigrams = large_only_bigrams.orderBy(F.col("count").desc(), "bigram")

ranked_bigrams_with_rank = (ranked_bigrams.withColumn("rank", F.row_number().over(Window.orderBy(
                                                                                          F.col("count").desc(),
                                                                                          F.col("bigram")
                                                                                          )
                                                                                  )
                                                      )
)

large_ranked_bigrams_range = (ranked_bigrams_with_rank.filter((F.col("rank") >= 2501) & (F.col("rank") <= 2510)).select(
                                                                                                                  F.col("rank").alias("Rank"),
                                                                                                                  F.col("bigram").alias("Bigram"),
                                                                                                                  F.col("count").alias("Count")
                                                                                                                  )
)

print("\nThe bigrams ranked 2501 - 2510 in the large dataset which do not appear in MAGPIE_unfiltered:\n")
large_ranked_bigrams_range.display()


The bigrams ranked 2501 - 2510 in the large dataset which do not appear in MAGPIE_unfiltered:



Rank,Bigram,Count
2501,was responsible,176
2502,which took,176
2503,working for,176
2504,The show,175
2505,a meeting,175
2506,back in,175
2507,built on,175
2508,featured on,175
2509,first team,175
2510,from which,175


### Question 6 - Extension.

In [0]:
large_target_bigrams_list = [row.bigram for row in large_ranked_bigrams_range.select(F.col("bigram")).collect()]

print(f"\nExample sentences containing the bigrams ranked 2501 - 2510 in\nthe large dataset which do not appear in MAGPIE_unfiltered:\n")

for large_target_bigram in large_target_bigrams_list:
    print(f"Example sentences containing the bigram: '{large_target_bigram}'")
    
    large_example_sentences = large.filter(F.col("sentence").contains(large_target_bigram))

    large_example_sentences.select(F.col("sentence").alias("Sentence")).limit(3).display()


Example sentences containing the bigrams ranked 2501 - 2510 in
the large dataset which do not appear in MAGPIE_unfiltered:

Example sentences containing the bigram: 'was responsible'


Sentence
"However, her brother John, who was responsible for organising her dowry, was slow to do so."
He was responsible for introducing a friend's poetry to Mr. Justice Talfourd (died 1854).
"He was responsible for expending and accounting for several million dollars as he acquired medical supplies and equipment for the Union Army, and distributed them to units throughout the country."


Example sentences containing the bigram: 'which took'


Sentence
The four additional domes were added during renovations which took place between 2008 and 2016.
"He returned back to Divizia A football during his second spell at Petrolul in the 1999–2000 season in which he earned two historical victories against Steaua București, a 5–1 at home and a 4–1 on the Ghencea stadium, also a 4–2 home victory against Mircea Lucescu's Rapid București who were the title holders, afterwards going to coach in the lower leagues for a second spell at Midia, later at Cimentul Fieni, Chindia Târgoviște and for a third spell at Plopeni, retiring after a third spell at Petrolul which took place from July until December 2004."
"The 1987 Tolly Ales English Professional Championship was a professional non-ranking snooker tournament, which took place in February 1987 in Ipswich, England."


Example sentences containing the bigram: 'working for'


Sentence
At least once Isabella wrongly stated the light direction to the artists working for her and she often sent changed her mind about the subjects and compositions.
"Gauri Maulekhi started working for People For Animals in Lucknow in 1995 as a volunteer, where she played a vital role in setting up the first animal shelter in the city."
He is now working for Kumar Mangat.


Example sentences containing the bigram: 'The show'


Sentence
"The show was based on Braly's childhood trips to Bangkok, Thailand."
"The show was officially aired and broadcast online on March 17, 2018 on iQiyi."
"The show was transformed when the writers decided to limit the storytelling, with the exception of the opening scene of the first episode, to the perspective of the eight characters."


Example sentences containing the bigram: 'a meeting'


Sentence
Phil calls a meeting to prepare for a showdown with Wendy.
"""Charles Kingston, who had recently returned from a triumphant visit to the United Kingdom, followed by a meeting with the Federation Commission, where he was elected chairman; cancelled all appointments and with his Commissioner of Crown Lands (L. O'Loughlin) was on the 4.30 pm Broken Hill express, and at Petersburg had a """"special"""" waiting to take them to Port Augusta"
"Fifty-one Democrats filed a lawsuit on December 5, to prevent the Democratic State Central Committee from choosing the special election candidate at a meeting."


Example sentences containing the bigram: 'back in'


Sentence
"A member of Mbabane Swallows squad in 2013, Tchakounte had to go to South Africa for personal reasons in January that year before arriving back in Swaziland in time for the league fixture confronting Malanti Chiefs."
"The 2017 bronze medalist, Russia's Elena Eremina, was unable to compete due to a back injury."
The 2018 season was KA's second season back in top tier football in Iceland following their relegation in 2004.


Example sentences containing the bigram: 'built on'


Sentence
"At the end of the 10th century, a 23 by 17 meters large basilica was built on their instead."
"Later, students travelled to Las Cruces for high school at the segregated Booker T. Washington School, which was built on Solano Street in 1934."
The school is a fully residential co-education school for students from Class IV to XI and is built on a sprawling 35-acre land.


Example sentences containing the bigram: 'featured on'


Sentence
"""""""Work It Out"""" was featured on Swiss German-language Radio SRF 3 as the song of the day on 3 February 2022."""
Her work was also featured on the literature table at the New England Hospital for Women and Children.
"The Club also attracted significant attention from South America in 2016 after it was featured on ESPN Argentina, who sent a film crew to the island to produce a documentary on why it had become so popular with South Americans."


Example sentences containing the bigram: 'first team'


Sentence
"Mishina/Galliamov are the first team to win gold in their Worlds debut since Gordeeva/Grinkov of the Soviet Union in 1986, and the second-youngest pair to win Worlds after Gordeeva/Grinkov."
"Gallagher returned to play with the Atlanta United first team in the middle of the 2020 Season, where he made 16 total appearances and registered 4 goals."
"He was an All-Patriot League first team selection and All-Atlantic Region first team both seasons, and was named All-ECAC first team in 2003."


Example sentences containing the bigram: 'from which'


Sentence
"The relationship between derived nominals and the corresponding verb from which it is derived, is idiosyncratic and highly irregular."
"She reached the university's mandatory retirement barrier in 2010 and took a post in the philosophy and letters department at Rissho University, from which she retired in 2015."
"""Gentilicia of this type were common in Umbria and Picenum, and most of the Sibidieni known from inscriptions seem to have lived at or near Tuficum in Umbria, from which it appears that the Sibidieni may have been of Umbrian origin, although the surname """"Sabinus"""" borne by some of the family suggest that they may have been Sabines."""
