<a href="https://colab.research.google.com/github/sakarla/AI-in-the-Built-Environment/blob/main/week%204_5_Data%20Visualization/Notebook%20code/text_preprocessing_visulization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text pre-processing and visulization:**
In this section, we will implement several preprocessing techniques on the text data collected previously. The objective of this part is to familiarize you with essential steps that enhance data clarity, ensuring it is well-prepared for the machine learning process.


*    Learn text preprocessing
*    Visualization
  *   Bar chart
  *   Cloud of words
*  Bag-of-words
*  Word2Vec





Let's get started!

# **1. Text preprocessing**

To start, we will link this notebook to your Google Drive. Make sure you are logged in on your Google account

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Let's combine all the data we've collected to create one file.

You can change the directory to where your files are! (*```folder_path = '/content/drive/MyDrive/Colab Notebooks/text'```*)
Also, you can change the directory where you want to save a new file. (*``` with open('/content/drive/MyDrive/Colab Notebooks/text/combined_data.json', 'w') as f:```*)

In [None]:
import json
import os

#change directory if you need
##################################################################
# Specify the folder containing your JSON files
folder_path = '/content/drive/MyDrive/Colab Notebooks/text'
##################################################################

# List all JSON files in the folder
json_files = [file for file in os.listdir(folder_path) if file.endswith('.json')]

combined_data = []

for file in json_files:
    with open(os.path.join(folder_path, file), 'r') as f:
        data = json.load(f)
        combined_data.extend(data)  # or use `.append(data)` for dictionary data

#change directory if you need
##################################################################
# Save the combined data to a new JSON file
with open('/content/drive/MyDrive/Colab Notebooks/text/combined_data.json', 'w') as f:
    json.dump(combined_data, f, indent=4)
##################################################################


**Step 1.1: Import Libraries**

Make sure to install any libraries that are not already installed by using `!pip install library_name`.

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer


**Step 1.2: Load text File**

Assuming you have a JSON file called **combined_data.json**, you can read it as follows:

In [None]:
import json

###########################################################################
# Load JSON Data
json_file = '/content/drive/MyDrive/Colab Notebooks/text/combined_data.json'  # Replace with the path to your JSON file
###########################################################################

with open(json_file, 'r', encoding='utf-8') as file:
    data = json.load(file)

**Step 1.3: Extract and Preprocess Text Data**

We will extract the text content from the JSON file and preprocess it.

In [None]:
# Initialize an empty list to store text content
text_data = []
#  Extract "content" from JSON
for item in data:
    if 'content' in item:
        text_content = item['content']

        # Join the list of content elements into a single string
        text_content = ' '.join(text_content)
        text_data.append(text_content)


Now, **text_data** contains a list of sentences from your text file.

 Before preprocessing, it's essential to understand your data. You can do this by checking the first few rows of your Data and getting some basic statistics.

In [None]:
# Display the first few sentences
print(text_data[:1])  # Replace 5 with the number of sentences you want to display

# Get basic statistics of the text data
print("Number of text:", len(text_data))


**Step 1.4: Text Cleaning**

Text data often contains noise that needs to be cleaned. Here are some common text cleaning steps:


**Lowercasing**: Convert all text to lowercase to ensure consistency.

In [None]:
len(text_data)

In [None]:
text_data = [sentence.lower() for sentence in text_data]
print(text_data)

**Removing Special Characters and Numbers:** Remove punctuation, special characters, and numbers using regular expressions.

`sentence.replace('\n', ' ')`: Within each sentence, this part of the code uses the replace method to replace all occurrences of the newline character ('\n') with nothing('').

`r'[^a-zA-Z\s]':` This regular expression pattern matches any character that is not an uppercase or lowercase alphabet letter (a-zA-Z) and not a whitespace character (\s).

In [None]:
text_data = [re.sub(r'[^a-zA-Z\s]', '', sentence.replace('\n', '')) for sentence in text_data]

print(text_data)

**Tokenization:** Split text into individual words or tokens.

In [None]:
import nltk
nltk.download('punkt')


In [None]:
text_data = [nltk.word_tokenize(sentence) for sentence in text_data]
print(text_data)

**Stopword Removal:** Remove common stopwords (e.g., "the", "and", "is") to reduce noise.

In [None]:
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tokens_no_stopwords = [[word for word in sentence if word not in stop_words] for sentence in text_data]

In [None]:
print(tokens_no_stopwords)

# **2. Visualization**:
We want to visualize the frequency of words step by step. you can create a bar chart to display the word frequencies.

# **2.1 bar chart**

**Step 2.1:Import Libraries**

Import the necessary libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter




**Step 2.2**: Count Word Frequencies

Count the frequency of each word in the preprocessed text data:



In [None]:
# Combine all preprocessed sentences into a single list of words
all_words = [word for sentence in text_data for word in sentence]

# Count word frequencies
word_freq = Counter(all_words)


In [None]:
word_freq

**Step 2.3**: Visualize Word Frequencies

Visualize the word frequencies using a bar chart:

In [None]:
# Get the top N most common words
N = 20  # Change this value to display more or fewer words
most_common_words = word_freq.most_common(N)

# Extract the words and their frequencies
words, frequencies = zip(*most_common_words)

# Create a bar chart
plt.figure(figsize=(12, 6))
plt.bar(words, frequencies)
plt.title(f"Top {N} Most Common Words")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha="right")  # Rotate x-axis labels for better readability
plt.show()


This code will count the frequencies of words in the preprocessed text data and create a bar chart to visualize the top N most common words. You can adjust the value of N to display more or fewer words in the chart.

# **2.2 Create Cloud words:**

**Step 1:** Import Libraries

Import the necessary libraries:

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt


**Step 2:** Combine Preprocessed Text

Combine all the preprocessed sentences into a single text:

In [None]:
combined_text = " ".join([" ".join(sentence) for sentence in text_data])
print(combined_text)

**Step 3:** Generate the Word Cloud

Generate the word cloud from the combined text:

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(combined_text)


**Step 4:** Display the Word Cloud

Display the word cloud using matplotlib:

In [None]:
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud")
plt.show()


# **3. Bag- of-words**

Bag of Words is a simple text representation method where each document is represented by a fixed-length vector. Each element of the vector corresponds to the frequency of a word in the document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Flatten the token list into strings for vectorization
flattened_text = [' '.join(token) for token in tokens_no_stopwords]

# create the vocabulary
vectorizer = CountVectorizer()

# fit the vocabulary to the text data
vectorizer.fit(flattened_text)

# create the bag-of-words model
bow_model = vectorizer.transform(flattened_text)

# print the bag-of-words model
print(bow_model)

The bow_model variable is a sparse matrix that contains the frequency of each word in the vocabulary for each text document in the *text_data list*. You can access the vocabulary and the mapping from words to indices using the *vocabulary_ attribute* of the *CountVectorizer* object.

In [None]:
# print the vocabulary
print(vectorizer.vocabulary_)

# print the word-to-index mapping
print(vectorizer.vocabulary_['images'])

## **Alternatives to a Bag-of-Words in Python**

When representing text data in natural language processing (NLP) tasks, several alternatives to the bag-of-words model can be more effective:

### 1. N-grams
- **Definition**: Contiguous sequences of n words in a text document.
- **Usefulness**: Captures the relationship between adjacent words, helpful for understanding word order and meaning.

### 2. Word Embeddings
- **Definition**: Dense, low-dimensional representations of words.
- **Usefulness**: Captures the semantic relationships between words, representing their meaning and relationships.

### 3. Part-of-Speech Tags
- **Definition**: Identifies the part of speech (e.g., noun, verb, adjective) of each word.
- **Usefulness**: Captures the syntactic structure and relationships between words.

### 4. Named Entity Recognition (NER)
- **Definition**: Identifies and classifies named entities (e.g., people, organizations, locations) in a text document.
- **Usefulness**: Extracts structured information from unstructured text, identifying important entities.

### 5. Syntactic Parsing
- **Definition**: Analyzes the structure of a sentence and determines relationships between words.
- **Usefulness**: Captures the syntactic structure and relationships between words in a text document.


# **4. Word2Vec**
**Word2Vec** is a **word embedding** technique that represents words in continuous vector space where semantically similar words are mapped to nearby points.

In [None]:
from gensim.models import Word2Vec
# Train Word2Vec model
word2vec_model = Word2Vec(tokens_no_stopwords, vector_size=100, window=5, min_count=1, workers=4)

# Get the vector for a word
word_vector = word2vec_model.wv['house']
print("Word vector for 'house':", word_vector)

# Check similarity between two words
similarity = word2vec_model.wv.similarity('house', 'home')
print("Similarity between 'house' and 'home':", similarity)