In [14]:
input = """As a term, data analytics predominantly refers to an assortment of applications, from basic business
intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced
analytics. In that sense, it's similar in nature to business analytics, another umbrella term for
approaches to analyzing data -- with the difference that the latter is oriented to business uses, while
data analytics has a broader focus. The expansive view of the term isn't universal, though: In some
cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate
category. Data analytics initiatives can help businesses increase revenues, improve operational
efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to
emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of
boosting business performance. Depending on the particular application, the data that's analyzed
can consist of either historical records or new information that has been processed for real-time
analytics uses. In addition, it can come from a mix of internal systems and external data sources. At
a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find
patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical
techniques to determine whether hypotheses about a data set are true or false. EDA is often
compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a
distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis. Data
analytics can also be separated into quantitative data analysis and qualitative data analysis. The
former involves analysis of numerical data with quantifiable variables that can be compared or
measured statistically. The qualitative approach is more interpretive -- it focuses on understanding
the content of non-numerical data like text, images, audio and video, including common phrases,
themes and points of view"""

input = input.lower()

# Split the paragraph into lines
lines = input.splitlines()

In [31]:
lines

['as a term, data analytics predominantly refers to an assortment of applications, from basic business',
 'intelligence (bi), reporting and online analytical processing (olap) to various forms of advanced',
 "analytics. in that sense, it's similar in nature to business analytics, another umbrella term for",
 'approaches to analyzing data -- with the difference that the latter is oriented to business uses, while',
 "data analytics has a broader focus. the expansive view of the term isn't universal, though: in some",
 'cases, people use data analytics specifically to mean advanced analytics, treating bi as a separate',
 'category. data analytics initiatives can help businesses increase revenues, improve operational',
 'efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to',
 'emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of',
 "boosting business performance. depending on the particular application, th

a) What is the probability of the word “data” occurring in each line ?

In [32]:
import re
def word_probability(line):
    """
    Calculate the probability of the word 'data' appearing in a line of text.
    Parameters:
        line (str): The input line of text.
    Returns:
        float: The probability of 'data' appearing in the line.
    """
    words = re.split(' |\n|,', line)
    word_count = len(words)
    if word_count == 0:
        return 0  # Handle case where line is empty
    new_count = words.count('data')  # Count occurrences of 'data'
    return new_count / word_count  # Calculate probability

for i, line in enumerate(lines, start=1):
    probability = word_probability(line)
    print(f"Probability of 'data' appearing in line {i}: {probability:.2f}")

Probability of 'data' appearing in line 1: 0.06
Probability of 'data' appearing in line 2: 0.00
Probability of 'data' appearing in line 3: 0.00
Probability of 'data' appearing in line 4: 0.06
Probability of 'data' appearing in line 5: 0.06
Probability of 'data' appearing in line 6: 0.06
Probability of 'data' appearing in line 7: 0.08
Probability of 'data' appearing in line 8: 0.00
Probability of 'data' appearing in line 9: 0.00
Probability of 'data' appearing in line 10: 0.08
Probability of 'data' appearing in line 11: 0.00
Probability of 'data' appearing in line 12: 0.05
Probability of 'data' appearing in line 13: 0.12
Probability of 'data' appearing in line 14: 0.13
Probability of 'data' appearing in line 15: 0.06
Probability of 'data' appearing in line 16: 0.00
Probability of 'data' appearing in line 17: 0.12
Probability of 'data' appearing in line 18: 0.14
Probability of 'data' appearing in line 19: 0.07
Probability of 'data' appearing in line 20: 0.00
Probability of 'data' appeari

b) What is the distribution of distinct word counts across all the lines ? 

In [25]:
import re
splitted_input = re.split(' |\n|,', input)

In [26]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
nltk.download('punkt')

# Calculate frequency distribution
fdist = FreqDist(splitted_input)

# Get the most common words
most_common_words = fdist.most_common(320)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\oggy0\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [27]:
most_common_words

[('', 23),
 ('data', 18),
 ('to', 11),
 ('the', 11),
 ('a', 10),
 ('of', 10),
 ('analytics', 9),
 ('and', 9),
 ('in', 6),
 ('can', 5),
 ('business', 4),
 ('that', 4),
 ('--', 4),
 ('is', 4),
 ('or', 4),
 ('analysis', 4),
 ('term', 3),
 ('with', 3),
 ('as', 2),
 ('from', 2),
 ('advanced', 2),
 ('for', 2),
 ('while', 2),
 ('has', 2),
 ('view', 2),
 ('more', 2),
 ('on', 2),
 ('it', 2),
 ('exploratory', 2),
 ('which', 2),
 ('compared', 2),
 ('work', 2),
 ('analysis.', 2),
 ('be', 2),
 ('qualitative', 2),
 ('predominantly', 1),
 ('refers', 1),
 ('an', 1),
 ('assortment', 1),
 ('applications', 1),
 ('basic', 1),
 ('intelligence', 1),
 ('(bi)', 1),
 ('reporting', 1),
 ('online', 1),
 ('analytical', 1),
 ('processing', 1),
 ('(olap)', 1),
 ('various', 1),
 ('forms', 1),
 ('analytics.', 1),
 ('sense', 1),
 ("it's", 1),
 ('similar', 1),
 ('nature', 1),
 ('another', 1),
 ('umbrella', 1),
 ('approaches', 1),
 ('analyzing', 1),
 ('difference', 1),
 ('latter', 1),
 ('oriented', 1),
 ('uses', 1),
 ('

c) What is the probability of the word “analytics” occurring after the word “data” ? 

In [29]:
count_data = 0
count_analytics = 0
for i in range(len(splitted_input) - 1):
    if splitted_input[i] == 'data' and splitted_input[i + 1] == 'analytics':
        count_analytics += 1
    if splitted_input[i] == 'data':
        count_data += 1
print(f"Probability of 'analytics' appearing after 'data': {count_analytics/count_data:.2f}")

Probability of 'analytics' appearing after 'data': 0.33
