# Project Objective

This project aims to build a recommendation system for scientific articles using Doc2Vec and cosine similarity. The system will:

1. **Pre-process** a dataset of scientific articles, focusing on their abstracts.
2. **Train a Doc2Vec model** to generate vector representations of the articles.
3. **Develop a function** to recommend similar articles based on user input.
4. **Take user input** in the form of an article abstract.
5. **Generate recommendations** by calculating cosine similarity between the user's input and existing articles in the database.

In [None]:
# Import Libraries
# Import pandas for data manipulation and analysis
import pandas as pd
# building the Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# for calculating cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
# for regular expressions to clean text
import re
# for tokenizing text
from nltk.tokenize import word_tokenize
# for removing stop words
from nltk.corpus import stopwords
# for text processing
import nltk

In [None]:
# Download necessary resources for NLTK
nltk.download('punkt') # Download punkt sentence tokenizer
nltk.download('punkt_tab')
nltk.download('stopwords') # Download stopwords list

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Load Data
data = pd.read_csv('https://drive.google.com/uc?id=1pvcuGk2nRTsYcd-l-_yNBzvvRj2qW5rF&export=download', encoding='latin1')
data.head() # Display the first 5 rows of the dataframe

Unnamed: 0,paper_id,paper_title,author_keywords,abstract,area
0,1,Bayesian Nonparametric Inverse Reinforcement L...,"Markov Decision Processes, Bayesian Nonparamet...",In this paper we develop a Bayesian nonparamet...,Machine Learning I
1,2,State Abstraction in Reinforcement Learning by...,"Reinforcement learning, state abstraction, int...",Q-learning and other linear dynamic learning a...,Machine Learning I
2,3,A knowledge growth and consolidation framework...,"Lifelong machine learning, oblivion criterion,...",A more effective vision of machine learning sy...,Machine Learning I
3,4,LaCova: A Tree-Based Multi-Label Classifier us...,"Multi-label learning, Decision Trees, Covarian...",Dealing with multiple labels is a supervised l...,Machine Learning I
4,5,Combining Exact And Metaheuristic Techniques F...,"finite-state machines, constraint satisfaction...",This paper addresses the problem of learning e...,Machine Learning I


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   paper_id         90 non-null     int64 
 1   paper_title      90 non-null     object
 2   author_keywords  90 non-null     object
 3   abstract         90 non-null     object
 4   area             90 non-null     object
dtypes: int64(1), object(4)
memory usage: 3.6+ KB


In [None]:
# Pre-processing function
def pre_processing(text):
  text=text.lower()
  text=re.sub(r'[^a-zA-Z0-9]+', '', text)
  tokens=word_tokenize(text)
  stop_words=set(stopwords.words('english'))
  filterred_tokens=[word for word in tokens if word not in stop_words]
  return ' '.join(filterred_tokens)

In [None]:
# Apply pre-processing to the abstract column
data['processed_abstract'] = data['abstract'].fillna('').apply(pre_processing)

In [None]:
# Create TaggedDocuments for Doc2Vec
documents = [TaggedDocument(words=word_tokenize(abstract), tags=[str(i)]) for i, abstract in enumerate(data['processed_abstract'])]


In [None]:
# Build Doc2Vec model
# Initialize and train the Doc2Vec model
model = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4, epochs=20)

In [None]:
# Function to get recommendations
def get_recommendations(user_abstract, top_n=5):
  # Pre-process user abstract
  user_abstract = pre_processing(user_abstract)

  # Infer vector for user abstract
  user_vector = model.infer_vector(word_tokenize(user_abstract))

  # Calculate cosine similarity with all articles
  similarities = cosine_similarity([user_vector], [model.dv[str(i)] for i in range(len(data))])[0]

  # Get indices of top recommendations
  top_indices = similarities.argsort()[-top_n:][::-1]

  # Return top recommendations
  recommendations = data.iloc[top_indices]
  return recommendations

In [None]:
user_abstract = input('write abstract')
recommendations = get_recommendations(user_abstract)
recommendations

write abstractbiology


Unnamed: 0,paper_id,paper_title,author_keywords,abstract,area,processed_abstract
80,81,Speeding Learning of Personalized Audio Equali...,"personalized item, audio equalizer, transfer l...",Audio equalizers (EQs) are perhaps the most co...,Adaptive Data-Driven Modeling in Dynamic Envir...,audioequalizerseqsareperhapsthemostcommonlyuse...
87,88,Causal Discovery from Spatio-Temporal Data wit...,"spatio-temporal data, causal discovery, graphi...",Causal discovery algorithms have been used to ...,Machine learning of graphical models in static...,causaldiscoveryalgorithmshavebeenusedtoidentif...
15,16,Semi-Supervised Kernel-Based Temporal Clustering,"Kernel k-means, Semi-supervised clustering, Te...","In this paper, we adapt two existing methods t...",Semi-Supervised Learning,inthispaperweadapttwoexistingmethodstoperforms...
25,26,A Hybrid Genetic-Programming Swarm-Optimisatio...,"Sociology, Statistics, Noise, Testing, Predict...",Advances in high frequency trading in financia...,Real-time Systems and Industry,advancesinhighfrequencytradinginfinancialmarke...
58,59,A Genetic Algorithm Approach to Partitioning C...,"Partitioning Clustering, Clustering, Divisive ...",Acquiring a Master Degree is becoming a common...,Machine Learning Applications in Education I,acquiringamasterdegreeisbecomingacommonpractic...
