<a href="https://colab.research.google.com/github/swagatskalita092/Text-preprocessing/blob/main/Text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [48]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [49]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [50]:
file_path = "/content/sample1.csv"
df = pd.read_csv(file_path)

In [51]:
print("\n📌 Dataset Overview:")
print(df.info())


📌 Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72289 entries, 0 to 72288
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   issue_url                 72289 non-null  object
 1   issue_label               72289 non-null  object
 2   issue_created_at          72289 non-null  object
 3   issue_author_association  72289 non-null  object
 4   repository_url            72289 non-null  object
 5   issue_title               72289 non-null  object
 6   issue_body                65141 non-null  object
dtypes: object(7)
memory usage: 3.9+ MB
None


In [52]:
missing_values = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

In [53]:
print("\n❌ Missing Values Before Processing:")
print(pd.DataFrame({"Missing Count": missing_values, "Percentage": missing_percent}))


❌ Missing Values Before Processing:
                          Missing Count  Percentage
issue_url                             0    0.000000
issue_label                           0    0.000000
issue_created_at                      0    0.000000
issue_author_association              0    0.000000
repository_url                        0    0.000000
issue_title                           0    0.000000
issue_body                         7148    9.888088


In [54]:
df.replace("", pd.NA, inplace=True)

In [55]:
df["issue_body"].fillna("No description available", inplace=True)  # Fill missing text

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["issue_body"].fillna("No description available", inplace=True)  # Fill missing text


In [56]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [57]:
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenization
    tokens = [word for word in tokens if word not in stop_words]  # Remove Stopwords
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatization
    return " ".join(tokens)

In [58]:
df["processed_issue_body"] = df["issue_body"].apply(preprocess_text)

In [59]:
missing_after = df["processed_issue_body"].isnull().sum()

In [60]:
print(f"\n✅ Missing Data After Preprocessing: {missing_after}")



✅ Missing Data After Preprocessing: 0


In [61]:
print("\n📊 Sample Preprocessed Data:")
print(df[["issue_body", "processed_issue_body"]].head(5))


📊 Sample Preprocessed Data:
                                          issue_body  \
0  In the Entities example, we there are some `__...   
1  **Describe the bug**\r\nUpdate the blog link i...   
2  Consider these two expressions:\r\n```\r\nf (g...   
3  ## Description  \r\nWhen grid has no height an...   
4  <!--\r\nThank you for reporting a crash in Ope...   

                                processed_issue_body  
0  entity example tilesrcrect field null httpsgit...  
1  describe bug update blog link entire website n...  
2  consider two expression f g x f g x first expr...  
3  description grid height data single row added ...  
4  thank reporting crash opensips order u underst...  


In [62]:
clean_file_path = "/content/cleaned_dataset.csv"
df.to_csv(clean_file_path, index=False)