<a href="https://colab.research.google.com/github/sidmohan0/misc-python-scripts/blob/main/whatsappcleanup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
This repository contains a Python script for cleaning and extracting semantic topics from a WhatsApp chat archive text file. The script takes in the text file, preprocesses the data, tokenizes the text, and extracts relevant words and phrases based on a list of topics provided by the user. The final output is a CSV or Pandas DataFrame that can be fed into a semantic search engine for analysis and exploration.

# Requirements
Python 3.x
Pandas
NLTK
Regex
Usage
To use the script, clone this repository and navigate to the directory in your terminal or command prompt. Run the following command:


```
python main.py [input_file] [topics_file] [output_file]

```


Where [input_file] is the name of your WhatsApp chat archive text file, [topics_file] is the name of the file containing the list of topics, and [output_file] is the name of the output CSV or DataFrame.

# Contributions
Contributions are welcome! If you'd like to make changes or improvements to the code, please open a pull request.

# License
This project is licensed under the MIT License. See LICENSE for details.

# Initialization

Let's start by installing the required libraries:

In [2]:
!pip install nltk



In [19]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

For this notebook I'm going to use a text file, chats.txt, which contains several lines of conversation between three fictional people (Sarah, Tom, John) and the conversations span a mix of fully formatted (with timestamp), just the name of the person (i.e. "John: ") or empty lines or random characters. 

In [18]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [17]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:


def preprocess(text):
    # Step 1: Lowercase the text
    text = text.lower()
    
    # Step 2: Tokenize the text
    tokens = word_tokenize(text)
    
    # Step 3: Remove stop words
    stop_words = set(stopwords.words("english"))
    tokens = [x for x in tokens if x not in stop_words]
    
    # Step 4: Remove punctuation and special characters
    tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]
    
    # Step 5: Identify proper nouns and remove them
    pos_tags = nltk.pos_tag(tokens)
    tokens = [token for token, pos in pos_tags if pos != 'NNP']
    
    # Step 6: Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens


file_name = 'chats.txt'
names = set()

with open(file_name, 'r') as file:
  text = file.read()
  lines = text.split('\n')
  
  for line in lines:
    if line.startswith('['):
      split_line = line.split(']')
      if len(split_line) >= 2:
        names.add(split_line[1].split(':')[0].strip())

    elif line.count(':') == 1:
        names.add(line.split(':')[0].strip())
    
print("Names:")
print(names)

preprocessed_text = preprocess(text)
print("Preprocessed text:")
print(preprocessed_text)



In [None]:
names_lower = [item.lower() for item in names]
print(names_lower)


In [None]:
def remove_names(lst):
    return [x for x in lst if x not in names_lower]

processed_list_names = remove_names(preprocessed_text)

print(preprocessed_text)
print(processed_list_names)

In [None]:

def remove_blank_spaces(lst):
    return [x.strip() for x in lst if x.strip()]
processed_list_no_names_no_spaces = remove_blank_spaces(processed_list_names)

print(preprocessed_text)
print(processed_list_names)
print(processed_list_no_names_no_spaces)

In [None]:
def remove_short_items(lst):
    return [x for x in lst if len(x) > 2]
processed_list_no_names_no_spaces_no_short_items = remove_blank_spaces(processed_list_no_names_no_spaces)

print(preprocessed_text)
print(processed_list_names)
print(processed_list_no_names_no_spaces)
print(processed_list_no_names_no_spaces_no_short_items)

In [None]:
def print_length_of_strings(list_of_strings):
    for string in list_of_strings:
        print(len(string))

print_length_of_strings(processed_list_no_names_no_spaces_no_short_items)

In [None]:
def remove_short_strings(list_of_strings):
    for string in list_of_strings:
        if len(string) < 2:
            list_of_strings.remove(string)
    return list_of_strings

print(remove_short_strings(processed_list_no_names_no_spaces_no_short_items))

In [60]:
final_list = processed_list_no_names_no_spaces_no_short_items

When working with natural language processing (NLP) and large amounts of text data, the dimensionality of the data can become quite high. This can make it difficult to process and analyze the data efficiently. To address this issue, we often use a technique called stemming or lemmatizing to reduce the dimensionality of the data.

Stemming and lemmatizing are processes that aim to reduce words to their base or root form. For example, words like "running," "runner," and "ran" would all be reduced to their base form "run." By reducing words to their base form, we can better identify and group together words that have a similar meaning, even if they are written differently.

The two most common techniques for stemming or lemmatizing in NLP are the PorterStemmer and the WordNetLemmatizer. The PorterStemmer is a rule-based approach to stemming that uses a set of rules to reduce words to their base form. The WordNetLemmatizer, on the other hand, uses a dictionary to look up words and reduce them to their base form.

There are trade-offs to both techniques. The PorterStemmer is a relatively simple and fast approach, but it may not always produce the most accurate results. The WordNetLemmatizer, on the other hand, is a more sophisticated approach that can produce more accurate results, but it may also be slower and more resource-intensive.

Ultimately, the choice between stemming and lemmatizing will depend on the specific requirements of your NLP project. If accuracy is a top priority, lemmatizing may be the way to go. If speed and simplicity are more important, stemming may be the better choice.

In this case, I'm going to go ahead with lemmatizing.  in this case since i'm running on my own compute, I'll spluge and go with WordNetLemmatizer. 

In [61]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# lemmatize words in a list
lemmatized_words = [lemmatizer.lemmatize(word) for word in final_list]

# Create a dictionary to store the themes and their counts
themes = {}

# Loop through each token in the preprocessed text
for token in final_list:
    # If the token is already a key in the themes dictionary, increment its value by 1
    if token in themes:
        themes[token] += 1
    # Otherwise, add the token to the themes dictionary with a value of 1
    else:
        themes[token] = 1

# Sort the themes by their count, in descending order
sorted_themes = sorted(themes.items(), key=lambda x: x[1], reverse=True)

# Print the top 5 most common themes
print("Top 5 themes:")
for i in range(5):
    print("{}. {} ({} occurrences)".format(i + 1, sorted_themes[i][0], sorted_themes[i][1]))

# Extract and categorize important dates and events
# (TODO: implement code to extract and categorize dates)



Top 5 themes:
1. great (11 occurrences)
2. 01012022 (11 occurrences)
3. ll (7 occurrences)
4. one (7 occurrences)
5. hey (6 occurrences)
