<a href="https://colab.research.google.com/github/ravi-phdm23/Articles/blob/main/NLP_for_literature_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To categorize the research articles based on the given sample data using unsupervised learning and NLP techniques, here is a structured approach. This Python code will employ clustering based on text features extracted from the concatenated columns of each article (Title, Abstract, Keywords, etc.) and then use similarity with predefined themes to assign categories.

### Step-by-Step Approach

1. **Data Loading and Preprocessing**: Load the CSV data and preprocess it by concatenating relevant columns.
2. **Text Vectorization**: Convert the concatenated text into numerical format using `TF-IDF Vectorization`.
3. **Unsupervised Clustering (KMeans)**: Apply KMeans clustering to identify groups of articles based on their content.
4. **Theme Matching with Cosine Similarity**: Assign each cluster to the closest theme based on similarity to pre-defined keywords.

### Sample Code

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
df = pd.read_csv("scopus_198 - selected columns.csv")

# Combine all text columns for clustering
df['Combined_Text'] = df['Title'].fillna('') + ' ' + df['Abstract'].fillna('') + ' ' + df['Author Keywords'].fillna('') + ' ' + df['Index Keywords'].fillna('')

# Define themes with associated keywords
themes = {
    "Credit Scoring and Risk Assessment": "credit scoring, risk assessment, credit risk, creditworthiness",
    "Fraud Detection and Anti-Money Laundering (AML)": "fraud detection, AML, anti-money laundering, anomaly detection, suspicious activities",
    "Customer Service Automation and Personalization": "chatbots, virtual assistants, customer service, personalization",
    "Predictive Analytics and Market Forecasting": "predictive analytics, market forecasting, time series, stock prices",
    "Customer Churn Prediction and Retention": "churn prediction, customer retention, attrition, customer loyalty",
    "Portfolio and Wealth Management": "portfolio management, wealth management, investment, robo-advisors",
    "Operational Efficiency and Process Automation": "process automation, RPA, transaction verification, compliance",
    "Sentiment Analysis and Financial Text Mining": "sentiment analysis, text mining, financial news, public perception",
    "Regulatory Compliance and Risk Management": "regulatory compliance, risk management, compliance requirements",
    "Financial Crime Detection and Cybersecurity": "cybersecurity, financial crime, digital security, cyber attacks"
}

# Vectorize the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
text_vectors = tfidf_vectorizer.fit_transform(df['Combined_Text'])

# Apply KMeans clustering
num_clusters = 10  # You can set this to a higher number if more clusters are needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(text_vectors)

# Calculate theme vectors for cosine similarity comparison
theme_vectors = tfidf_vectorizer.transform(themes.values())

# Assign the closest theme to each cluster
def assign_theme(cluster_num):
    cluster_texts = df[df['Cluster'] == cluster_num]['Combined_Text']
    cluster_vector = tfidf_vectorizer.transform(cluster_texts)
    similarity_scores = cosine_similarity(cluster_vector.mean(axis=0), theme_vectors)
    best_theme_idx = similarity_scores.argmax()
    return list(themes.keys())[best_theme_idx]

# Create a new column for Theme Category
df['Category'] = df['Cluster'].apply(assign_theme)

# Display the results
df_result = df[['Title', 'Abstract', 'Category']]
print(df_result.head())

# Save the output to a new CSV
df_result.to_csv("categorized_research_articles.csv", index=False)
print("Categorized research articles saved as 'categorized_research_articles.csv'")
```

### Explanation of Key Steps

1. **Concatenation**: Combine `Title`, `Abstract`, `Author Keywords`, and `Index Keywords` columns to maximize the text information for clustering.
2. **TF-IDF Vectorization**: Converts text to numerical form, capturing the importance of words across documents.
3. **KMeans Clustering**: Groups similar articles into clusters.
4. **Cosine Similarity for Theme Assignment**: Compares each cluster with predefined themes and assigns the closest theme as the category.

### Usage

- Run this code in a Google Colab notebook or local Python environment.
- Ensure the CSV file is uploaded and accessible.
- The output file `categorized_research_articles.csv` will contain each article with a new `Category` column.

This approach provides an automated and data-driven way to categorize articles based on clustering and thematic similarity, which you can then refine based on your domain knowledge.

# New Section

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
df = pd.read_csv("scopus_198 - selected columns.csv")

# Combine all text columns for clustering
df['Combined_Text'] = df['Title'].fillna('') + ' ' + df['Abstract'].fillna('') + ' ' + df['Author Keywords'].fillna('') + ' ' + df['Index Keywords'].fillna('')

# Define themes with associated keywords
themes = {
    "Credit Scoring and Risk Assessment": "credit scoring, risk assessment, credit risk, creditworthiness",
    "Fraud Detection and Anti-Money Laundering (AML)": "fraud detection, AML, anti-money laundering, anomaly detection, suspicious activities",
    "Customer Service Automation and Personalization": "chatbots, virtual assistants, customer service, personalization",
    "Predictive Analytics and Market Forecasting": "predictive analytics, market forecasting, time series, stock prices",
    "Customer Churn Prediction and Retention": "churn prediction, customer retention, attrition, customer loyalty",
    "Portfolio and Wealth Management": "portfolio management, wealth management, investment, robo-advisors",
    "Operational Efficiency and Process Automation": "process automation, RPA, transaction verification, compliance",
    "Sentiment Analysis and Financial Text Mining": "sentiment analysis, text mining, financial news, public perception",
    "Regulatory Compliance and Risk Management": "regulatory compliance, risk management, compliance requirements",
    "Financial Crime Detection and Cybersecurity": "cybersecurity, financial crime, digital security, cyber attacks"
}

# Vectorize the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=500)
text_vectors = tfidf_vectorizer.fit_transform(df['Combined_Text'])

# Apply KMeans clustering
num_clusters = 10  # You can set this to a higher number if more clusters are needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(text_vectors)

# Calculate theme vectors for cosine similarity comparison
theme_vectors = tfidf_vectorizer.transform(themes.values())

# Assign the closest theme to each cluster
def assign_theme(cluster_num):
    cluster_texts = df[df['Cluster'] == cluster_num]['Combined_Text']
    cluster_vector = tfidf_vectorizer.transform(cluster_texts)
    similarity_scores = cosine_similarity(cluster_vector.mean(axis=0), theme_vectors)
    best_theme_idx = similarity_scores.argmax()
    return list(themes.keys())[best_theme_idx]

# Create a new column for Theme Category
df['Category'] = df['Cluster'].apply(assign_theme)

# Display the results
df_result = df[['Title', 'Abstract', 'Category']]
print(df_result.head())

# Save the output to a new CSV
df_result.to_csv("categorized_research_articles.csv", index=False)
print("Categorized research articles saved as 'categorized_research_articles.csv'")
