<a href="https://colab.research.google.com/github/yardenmizrahi/NLP1/blob/main/NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP homework 1:**

This assignment will provide hands-on practice with text processing techniques in Python,
including tokenization, lemmatization, and stemming. You will also gain experience loading,
analyzing, and scraping textual data from different sources.


*Yarden Mizrahi - 209521293*

### **Import Python Libraries**

In [None]:
# import necessary python libraries
import nltk
import spacy
from bs4 import BeautifulSoup

### **Loading The Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')
path = "/content/drive/MyDrive/Data/spam.csv"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Data Loading & Basic Analysis**

In [None]:
import pandas as pd
# Download nltk resources if not already present
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv(path, encoding='latin-1')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


basic statistics on the data:

In [None]:
# Basic statistics
total_messages = len(df)
spam_messages = len(df[df['v1'] == 'spam'])
ham_messages = len(df[df['v1'] == 'ham'])

print("Total number of SMS messages:", total_messages)
print("Number of spam messages:", spam_messages)
print("Number of ham messages:", ham_messages)

Total number of SMS messages: 5572
Number of spam messages: 747
Number of ham messages: 4825


In [None]:
# Calculate average number of words per message
from nltk.tokenize import word_tokenize

df['word_count'] = df['v2'].apply(lambda x: len(word_tokenize(x)))
average_words_per_message = df['word_count'].mean()
print("Average number of words per message:", average_words_per_message)

Average number of words per message: 18.699389806173727


In [None]:
# Tokenize, remove stopwords, and lemmatize words
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
wordnet_lemmatizer = WordNetLemmatizer()

all_words = []
for message in df['v2']:
    words = word_tokenize(message.lower())
    filtered_words = [wordnet_lemmatizer.lemmatize(w) for w in words if w.isalnum() and w not in stop_words and w not in punctuation and w.isalpha()]
    all_words.extend(filtered_words)

# Calculate frequency distribution of words
fdist = FreqDist(all_words)

# Print 5 most frequent words
print("5 most frequent words:")
for word, frequency in fdist.most_common(5):
    print(f"{word}: {frequency}")

5 most frequent words:
u: 1184
call: 603
get: 396
ur: 381
gt: 318


In [None]:
# Number of words that appear only once
unique_words = [word for word, frequency in fdist.items() if frequency == 1]
print("Number of words that only appear once:", len(unique_words))

Number of words that only appear once: 3293


### **Text Processing**

### Tokenize the SMS text using both nltk and spaCy. Analyze the time complexity of the tokenization algorithm


In [None]:
# Tokenize using NLTK
def tokenize_nltk(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalnum() and word not in stop_words and word not in punctuation]
    return tokens


# Example usage
tokens_nltk = tokenize_nltk(df['v2'][0])
print("NLTK Tokens:", tokens_nltk)

NLTK Tokens: ['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']


In [None]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Tokenize using spaCy
def tokenize_spacy(text):
    doc = nlp(text)
    return [token.text for token in doc]

# Example usage
tokens_spacy = tokenize_spacy(df['v2'][0])
print("spaCy Tokens:", tokens_spacy)

spaCy Tokens: ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


The time complexity of the tokenization algorithms:

1. NLTK Tokenization:
   NLTK's `word_tokenize` function is based on regular expressions and it tokenizes text by splitting it into words based on whitespace and punctuation. The time complexity of NLTK's tokenization algorithm is generally considered to be linear, O(n), where n is the length of the input text.

2. spaCy Tokenization:
   spaCy's tokenization is based on a statistical model trained on large corpora of text. It uses a combination of rules and heuristics to segment text into tokens. The time complexity of spaCy's tokenization algorithm is typically linear or close to linear, O(n), where n is the length of the input text.

### Lemmatize the SMS text using nltk and spaCy. Analyze the time complexity of the lemmatization algorithm


In [None]:
# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize using NLTK
def lemmatize_nltk(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Example usage
lemmatized_nltk = lemmatize_nltk(df['v2'][0])
print("NLTK Lemmatized Tokens:", lemmatized_nltk)

NLTK Lemmatized Tokens: ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


In [None]:
# Lemmatize using spaCy
def lemmatize_spacy(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc]
    return lemmatized_tokens

# Example usage
lemmatized_spacy = lemmatize_spacy(df['v2'][0])
print("spaCy Lemmatized Tokens:", lemmatized_spacy)

spaCy Lemmatized Tokens: ['go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'get', 'amore', 'wat', '...']


The time complexity of the lemmatization algorithms:

1. **NLTK Lemmatization:**
   NLTK's lemmatization algorithm is based on WordNet, a lexical database of English words. NLTK's `WordNetLemmatizer` applies morphological analysis to reduce words to their base or dictionary form (lemma). The time complexity of NLTK's lemmatization algorithm depends on the length of the input text and the number of tokens. Since NLTK's lemmatization operates on individual tokens independently, its time complexity can be considered linear, O(n), where n is the number of tokens in the input text.

2. **spaCy Lemmatization:**
   spaCy's lemmatization algorithm is integrated into its linguistic processing pipeline. It utilizes pre-trained statistical models to perform lemmatization based on contextual information and linguistic rules. spaCy's lemmatization process involves analyzing the syntactic structure of the input text and mapping words to their base forms using the information encoded in the model. The time complexity of spaCy's lemmatization algorithm is influenced by the length of the input text and the complexity of the linguistic processing involved. While the exact time complexity may vary depending on the specific implementation and the characteristics of the input text, spaCy's lemmatization is generally efficient and can often be considered linear or close to linear in practice.

### Stem the SMS text using nltk and spaCy. Analyze the time complexity of the stemming algorithm


In [None]:
from nltk.stem import PorterStemmer

# Initialize Porter Stemmer
stemmer = PorterStemmer()

# Stem using NLTK
def stem_nltk(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Example usage
stemmed_nltk = stem_nltk(df['v2'][0])
print("NLTK Stemmed Tokens:", stemmed_nltk)

NLTK Stemmed Tokens: ['go', 'until', 'jurong', 'point', ',', 'crazi', '..', 'avail', 'onli', 'in', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'got', 'amor', 'wat', '...']


In [None]:
# Get the stemmer
if not nlp.has_pipe("stemmer"):
    print("Error: The spaCy model does not include a stemmer.")
else:
  stemmer = nlp.stem

  # Stem using spaCy
  def stem_spacy(text):
      doc = nlp(text)
      stemmed_tokens = [stemmer(token.text) for token in doc]
      return stemmed_tokens

  # Example usage
  stemmed_spacy = stem_spacy(df['v2'][0])
  print("spaCy Stemmed Tokens:", stemmed_spacy)


Error: The spaCy model does not include a stemmer.


The time complexity of the stemming algorithms:

1. **NLTK Stemming:**
   NLTK's Porter Stemmer algorithm is based on Porter's stemming algorithm, which applies a series of heuristic rules to remove suffixes from words to obtain their root forms (stems). The time complexity of NLTK's stemming algorithm depends on the length of the input text and the number of tokens. Since NLTK's stemming operates on individual tokens independently, its time complexity can be considered linear, O(n), where n is the number of tokens in the input text. However, the actual performance may vary depending on the complexity of the stemming rules and the length of the input text.

2. **spaCy Stemming:**  The spaCy model does not include a stemmer.

### For each technique, write 2-3 sentences comparing the nltk and spaCy implementation. Consider things like output format, processing speed, language support etc.

### Tokenization

- **NLTK:**
  - **Output Format:** The `word_tokenize` function in NLTK returns a list of strings, where each string is a token.
  - **Processing Speed:** NLTK's tokenization is relatively fast for small to medium-sized texts, but it might be slower for very large texts due to its rule-based approach.
  - **Language Support:** Primarily designed for English, although there are some tokenizers for other languages.

- **spaCy:**
  - **Output Format:** spaCy returns a `Doc` object that contains tokens, which can be accessed as attributes. Each token has additional linguistic information attached.
  - **Processing Speed:** spaCy is optimized for performance and can handle large texts efficiently due to its use of compiled code and optimized algorithms.
  - **Language Support:** spaCy provides robust support for multiple languages with pre-trained models for various languages.

### Lemmatization

- **NLTK:**
  - **Output Format:** The `WordNetLemmatizer` in NLTK returns a list of lemmatized words as strings.
  - **Processing Speed:** NLTK's lemmatization can be slower than spaCy because it relies on querying the WordNet database, which can introduce overhead.
  - **Language Support:** Mainly supports English, leveraging the WordNet database.

- **spaCy:**
  - **Output Format:** Lemmatization in spaCy is part of the token attributes, accessed via `token.lemma_`. The output is a list of lemmas.
  - **Processing Speed:** spaCy's lemmatization is generally faster due to its integrated pipeline and optimized processing.
  - **Language Support:** spaCy provides extensive language support with lemmatization models for multiple languages.

### Stemming

- **NLTK:**
  - **Output Format:** The `PorterStemmer` in NLTK returns a list of stemmed tokens as strings.
  - **Processing Speed:** Stemming in NLTK is relatively fast due to its rule-based approach, but it can be less accurate.
  - **Language Support:** Primarily supports English with various stemmers available (Porter, Lancaster, Snowball).

- **spaCy:**
  - **Output Format:** spaCy does not provide a direct stemming functionality. Instead, it focuses on lemmatization, which can serve a similar purpose.
  - **Processing Speed:** N/A for stemming.
  - **Language Support:** N/A for stemming.


### Print updated statistics on word count and frequent words after applying each technique.

In [None]:
df = df[['v1', 'v2']]  # Keep only the relevant columns

# Initialize stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

### **Preprocessing Functions**


**NLTK Tokenization, Lemmatization, and Stemming**

In [None]:
# Initialize NLTK tools
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Tokenization, Lemmatization, and Stemming with NLTK
def preprocess_nltk(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalnum() and word not in stop_words and word not in punctuation]
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    stems = [stemmer.stem(token) for token in tokens]
    return tokens, lemmas, stems


**spaCy Tokenization and Lemmatization**

In [None]:
# Tokenization and Lemmatization with spaCy
def preprocess_spacy(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
    return tokens, lemmas

### **Update Statistics Functions**

In [None]:
from collections import Counter
def update_statistics(tokens_list):
    word_counts = Counter(tokens_list)
    total_words = sum(word_counts.values())
    unique_words = len(word_counts)
    most_common_words = word_counts.most_common(5)
    words_appearing_once = sum(1 for count in word_counts.values() if count == 1)
    return total_words, unique_words, most_common_words, words_appearing_once

def print_statistics(method_name, tokens, lemmas, stems=None):
    print(f"Statistics for {method_name}:")

    total_words, unique_words, most_common_words, words_appearing_once = update_statistics(tokens)
    print(f"  Total words: {total_words}")
    print(f"  Unique words: {unique_words}")
    print(f"  Most common words: {most_common_words}")
    print(f"  Words appearing once: {words_appearing_once}")

    total_lemmas, unique_lemmas, most_common_lemmas, lemmas_appearing_once = update_statistics(lemmas)
    print(f"  Total lemmas: {total_lemmas}")
    print(f"  Unique lemmas: {unique_lemmas}")
    print(f"  Most common lemmas: {most_common_lemmas}")
    print(f"  Lemmas appearing once: {lemmas_appearing_once}")

    if stems:
        total_stems, unique_stems, most_common_stems, stems_appearing_once = update_statistics(stems)
        print(f"  Total stems: {total_stems}")
        print(f"  Unique stems: {unique_stems}")
        print(f"  Most common stems: {most_common_stems}")
        print(f"  Stems appearing once: {stems_appearing_once}")


### **Applying Preprocessing and Printing Statistics**

In [None]:
nltk_tokens, nltk_lemmas, nltk_stems = [], [], []
spacy_tokens, spacy_lemmas = [], []

for message in df['v2']:
    tokens, lemmas, stems = preprocess_nltk(message)
    nltk_tokens.extend(tokens)
    nltk_lemmas.extend(lemmas)
    nltk_stems.extend(stems)

    tokens, lemmas = preprocess_spacy(message)
    spacy_tokens.extend(tokens)
    spacy_lemmas.extend(lemmas)

print_statistics("NLTK", nltk_tokens, nltk_lemmas, nltk_stems)
print_statistics("spaCy", spacy_tokens, spacy_lemmas)


Statistics for NLTK:
  Total words: 56617
  Unique words: 8188
  Most common words: [('i', 1956), ('u', 1133), ('call', 576), ('2', 485), ('get', 385)]
  Words appearing once: 4107
  Total lemmas: 56617
  Unique lemmas: 7661
  Most common lemmas: [('i', 1956), ('u', 1197), ('call', 603), ('2', 485), ('get', 396)]
  Lemmas appearing once: 3803
  Total stems: 56617
  Unique stems: 6861
  Most common stems: [('i', 1956), ('u', 1133), ('call', 656), ('2', 485), ('go', 451)]
  Stems appearing once: 3298
Statistics for spaCy:
  Total words: 41172
  Unique words: 7036
  Most common words: [('u', 1098), ('ur', 380), ('nt', 319), ('free', 283), ('ok', 282)]
  Words appearing once: 3608
  Total lemmas: 41172
  Unique lemmas: 6139
  Most common lemmas: [('u', 1098), ('ur', 380), ('go', 329), ('come', 325), ('not', 313)]
  Lemmas appearing once: 3128


## **Web Scraping**

### Use BeautifulSoup to scrape text data from a public page on one of your social media profiles.

In [None]:
import requests

# URL of the page to scrape - my github repository "boris" readme file
url = 'https://github.com/yardenmizrahi/Boris'

# Send a GET request to the URL
response = requests.get(url)

if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

   # Find the main content div
    content_div = soup.find("div", {"class": "Box-sc-g0xbh4-0 bJMeLZ js-snippet-clipboard-copy-unpositioned"})

    # Extract the text from the content div
    text_content = content_div.text

    print(text_content)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Parallel Implementation of Proximity Criteria
Final project
Course 10324, Parallel and Distributed Computation
2023 Fall Semester
A set of N points is placed in two-dimensional plane. Coordinates (x, y) of each point P are defined as follows:
x = ((x2 – x1) / 2 ) * sin (tπ /2) + (x2 + x1) / 2)
y = ax + b
where (x1, x2, a, b) are constant parameters predefined for each point P.
Problem Definition
We will say that some point P from the set satisfies a Proximity Criteria if there exist at least K points in the set with a distance from the point P less than a given value D.
Given a value of parameter t, we want to find if there exist at least 3 points that satisfies the Proximity Criteria
Requirements
•	Perform checks for Proximity Criteria for tCount + 1 values of  t:
t = 2 * i / tCount  - 1,          i = 0,  1,  2,  3, …,  tCount
where tCount is a given integer number.
•	For each value of t find if there is three points that satisfy the Proximity Criteria. If such three points are found 

### Perform tokenization, lemmatization, and stemming on the scraped text

In [None]:
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

copy_text_content = text_content
# Tokenization
tokens = word_tokenize(copy_text_content)
tokens = [word.lower() for word in tokens if word.isalnum() and word not in stop_words and word not in punctuation]
print("Tokens:", tokens)

# Lemmatization
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

# Stemming
stems = [stemmer.stem(token) for token in tokens]
print("Stems:", stems)

Tokens: ['parallel', 'implementation', 'proximity', 'criteria', 'final', 'project', 'course', '10324', 'parallel', 'distributed', 'computation', '2023', 'fall', 'semester', 'a', 'set', 'n', 'points', 'placed', 'plane', 'coordinates', 'x', 'point', 'p', 'defined', 'follows', 'x', 'x2', 'x1', '2', 'sin', 'tπ', 'x2', 'x1', '2', 'ax', 'b', 'x1', 'x2', 'b', 'constant', 'parameters', 'predefined', 'point', 'problem', 'definition', 'we', 'say', 'point', 'p', 'set', 'satisfies', 'proximity', 'criteria', 'exist', 'least', 'k', 'points', 'set', 'distance', 'point', 'p', 'less', 'given', 'value', 'given', 'value', 'parameter', 'want', 'find', 'exist', 'least', '3', 'points', 'satisfies', 'proximity', 'criteria', 'requirements', 'perform', 'checks', 'proximity', 'criteria', 'tcount', '1', 'values', '2', 'tcount', '1', '0', '1', '2', '3', 'tcount', 'tcount', 'given', 'integer', 'number', 'for', 'value', 'find', 'three', 'points', 'satisfy', 'proximity', 'criteria', 'if', 'three', 'points', 'found',

### Print word statistics on the scraped data before and after text processing.

In [None]:
# Function to print statistics
def print_statistics(title, word_list):
    counter = Counter(word_list)
    total_words = len(word_list)
    unique_words = len(counter)
    most_common = counter.most_common(5)
    single_occurrences = len([word for word, count in counter.items() if count == 1])

    print(f"\n{title}")
    print(f"Total Words: {total_words}")
    print(f"Unique Words: {unique_words}")
    print(f"Most Frequent Words: {most_common}")
    print(f"Words that appear only once: {single_occurrences}")

print_statistics("Before processing statistics", text_content)
print_statistics("After processing statistics", stems)


Before processing statistics
Total Words: 2661
Unique Words: 72
Most Frequent Words: [(' ', 461), ('t', 246), ('e', 217), ('i', 185), ('o', 157)]
Words that appear only once: 15

After processing statistics
Total Words: 275
Unique Words: 139
Most Frequent Words: [('point', 20), ('proxim', 11), ('criteria', 11), ('satisfi', 9), ('the', 6)]
Words that appear only once: 94


## **WhatsApp Analysis**

### Import a .txt file of at least 50 WhatsApp messages in Hebrew.

In [None]:
import re

# Define the file path to your exported WhatsApp chat
file_path = '/content/drive/MyDrive/_chat.txt'

# Read the content of the file
with open(file_path, 'r', encoding='utf-8') as file:
    chat_data = file.read()

# Define a regex pattern to match individual messages
pattern = re.compile(r'\[(\d{2}\.\d{2}\.\d{4}, \d{2}:\d{2}:\d{2})\] (.*?): (.*)')

# Find all messages using the regex pattern
messages = pattern.findall(chat_data)

# Convert tuples to lists
messages = [list(msg) for msg in messages]

# Extract the last 60 messages
last_60_messages = messages[-60:]

# Flatten the list of lists to a single list
flattened_messages = [item for sublist in last_60_messages for item in sublist]

# Display the flattened list of messages
print(flattened_messages)

['21.12.2023, 11:22:46', '~\u202fNoam 🦥', '\u202b\u200f~\u202fNoam 🦥 הצטרף/ה לקבוצה באמצעות קישור ההזמנה\u202c', '21.12.2023, 11:23:51', 'דגנית ציטרין בר-און', '🤝ברוכות וברוכים הבאים לכל המצטרפים בשעות האחרונות. לטובת מי שהצטרף לאחרונה, אני משתפת משרות שנשלחו מוקדן יותר ולאחר מכן ישותפו משרות חדשות.', '21.12.2023, 11:24:48', 'דגנית ציטרין בר-און', 'https://www.linkedin.com/posts/paz-levin-3765751b1_join-my-team-in-the-heart-of-tel-aviv-activity-7140969976399642624-gx3_?utm_source=share&utm_medium=member_android', '21.12.2023, 11:24:48', 'דגנית ציטרין בר-און', 'שלום לכולם, ', '21.12.2023, 11:24:49', 'דגנית ציטרין בר-און', 'הבוגר אנדרי סיאפין מחפש סטודנטים או בוגרים מהנדסה רפואית לביצוע משימה זמנית לבדיקות ולידציה של המכשור הרפואי - מי שמעוניין או מעוניינת - פנו אל אנדרי:', '21.12.2023, 11:24:49', 'דגנית ציטרין בר-און', 'משרות מהבוגר אבי לוי :', '21.12.2023, 11:24:50', 'דגנית ציטרין בר-און', 'https://www.metacareers.com/jobs/?offices', '21.12.2023, 11:24:51', 'דגנית ציטרין בר-און', 'משרו

### Tokenize, lemmatize, and stem the WhatsApp data.

In [None]:
copy_messages = ','.join(flattened_messages)
# Tokenization
tokens = word_tokenize(copy_messages)
tokens = [word.lower() for word in tokens if word.isalnum() and word not in stop_words and word not in punctuation]
print("Tokens:", tokens)

# Lemmatization
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

# Stemming
stems = [stemmer.stem(token) for token in tokens]
print("Stems:", stems)

Tokens: ['noam', 'noam', 'לקבוצה', 'באמצעות', 'קישור', 'דגנית', 'ציטרין', 'וברוכים', 'הבאים', 'לכל', 'המצטרפים', 'בשעות', 'האחרונות', 'לטובת', 'מי', 'שהצטרף', 'לאחרונה', 'אני', 'משתפת', 'משרות', 'שנשלחו', 'מוקדן', 'יותר', 'ולאחר', 'מכן', 'ישותפו', 'משרות', 'דגנית', 'ציטרין', 'https', 'דגנית', 'ציטרין', 'שלום', 'לכולם', 'דגנית', 'ציטרין', 'הבוגר', 'אנדרי', 'סיאפין', 'מחפש', 'סטודנטים', 'או', 'בוגרים', 'מהנדסה', 'רפואית', 'לביצוע', 'משימה', 'זמנית', 'לבדיקות', 'ולידציה', 'של', 'המכשור', 'הרפואי', 'מי', 'שמעוניין', 'או', 'מעוניינת', 'פנו', 'אל', 'אנדרי', 'דגנית', 'ציטרין', 'משרות', 'מהבוגר', 'אבי', 'לוי', 'דגנית', 'ציטרין', 'https', 'דגנית', 'ציטרין', 'משרות', 'מהבוגר', 'תומר', 'הקישור', 'הוא', 'ללינקדאין', 'של', 'דגנית', 'ציטרין', 'דגנית', 'ציטרין', 'צרו', 'קשר', 'עם', 'עומר', 'כספי', 'בוגר', 'ומוסמך', 'אפקה', 'https', 'דגנית', 'ציטרין', 'משרות', 'חדשות', 'דגנית', 'ציטרין', 'https', 'דגנית', 'ציטרין', 'https', 'דגנית', 'ציטרין', 'בחברת', 'utopia', 'tech', 'corp', 'מחפשים', 'פרודאקט', 'תו

### Print comparisons of word statistics before and after processing.

In [None]:
print_statistics("Before processing statistics", ','.join(flattened_messages))
print_statistics("After processing statistics", stems)


Before processing statistics
Total Words: 5953
Unique Words: 113
Most Frequent Words: [(' ', 466), ('2', 318), (',', 246), ('י', 241), ('ו', 219)]
Words that appear only once: 6

After processing statistics
Total Words: 400
Unique Words: 156
Most Frequent Words: [('ציטרין', 44), ('דגנית', 43), ('לקבוצה', 17), ('באמצעות', 17), ('קישור', 17)]
Words that appear only once: 79
