<a href="https://colab.research.google.com/github/seloooselin/citation-analysis-project/blob/main/notebooks/citation1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/ScienceNLP-Lab/Citation-Integrity.git


fatal: destination path 'Citation-Integrity' already exists and is not an empty directory.


In [None]:
import json
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# File paths
claims_file = "Citation-Integrity/Data/multivers-format/claims-test.jsonl"
corpus_file = "Citation-Integrity/Data/multivers-format/corpus.jsonl"

# Function to load JSONL file
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Load claims-test.jsonl and corpus.jsonl
claims_data = load_jsonl(claims_file)
corpus_data = load_jsonl(corpus_file)

# Convert corpus.jsonl to a dictionary {doc_id: abstract}
corpus_dict = {str(doc["doc_id"]): " ".join(doc["abstract"]) for doc in corpus_data}

# Display the first few corpus entries
list(corpus_dict.items())[:5]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('66000', 'LncRNA loc285194 is a p53-regulated tumor suppressor'),
 ('66001',
  'Protein-coding genes account for only a small part of the human genome, whereas the vast majority of transcripts make up the non-coding RNAs including long non-coding RNAs (lncRNAs). Accumulating evidence indicates that lncRNAs could play a critical role in regulation of cellular processes such as cell growth and apoptosis as well as cancer progression and metastasis. LncRNA loc285194 was previously shown to be within a tumor suppressor unit in osteosarcoma and to suppress tumor cell growth. However, it is unknown regarding the regulation of loc285194. Moreover, the underlying mechanism by which loc285194 functions as a potential tumor suppressor is elusive. In this study, we show that loc285194 is a p53 transcription target; ectopic expression of loc285194 inhibits tumor cell growth both in vitro and in vivo. Through deletion analysis, we identify an active region responsible for tumor cell growth inhibi

In [None]:
# Research paper mapping of labels
label_mapping = {
    "ACCURATE": "ACCURATE",
    "INDIRECT": "ACCURATE",
    "CONTRADICT": "NOT_ACCURATE",
    "NOT_SUBSTANTIATE": "NOT_ACCURATE",
    "OVERSIMPLIFY": "NOT_ACCURATE",
    "MISQUOTE": "NOT_ACCURATE",
    "ETIQUETTE": "NOT_ACCURATE",
    "IRRELEVANT": "IRRELEVANT"
}


In [None]:
# Function to extract evidence text with proper sentence splitting
def extract_evidence_text(evidence):
    extracted_text = []
    extracted_labels = []
    doc_id_found = False  # Track if we found at least one matching doc_id

    for doc_id, details in evidence.items():
        doc_id = str(doc_id)  # Convert to string for dictionary lookup
        if doc_id in corpus_dict:  # Check if document exists in corpus
            doc_id_found = True
            full_text = sent_tokenize(corpus_dict[doc_id])  # Proper sentence tokenization

            for entry in details:
                if "label" in entry and "sentences" in entry:
                    mapped_label = label_mapping.get(entry["label"], "UNKNOWN")
                    extracted_labels.append(mapped_label)

                    # Extract the relevant sentences
                    for sent_id in entry["sentences"]:
                        if sent_id < len(full_text):  # Ensure index is valid
                            extracted_text.append(full_text[sent_id])
                        else:
                            print(f"Sentence index {sent_id} out of range for doc {doc_id}")

    # If no valid evidence was found, return "NO EVIDENCE" and "IRRELEVANT"
    if not extracted_text:
        if not doc_id_found:
            print(f"No matching document found for cited docs: {list(evidence.keys())}")
        return "NO EVIDENCE", "IRRELEVANT"

    # Use the most common label as the final one
    final_label = max(set(extracted_labels), key=extracted_labels.count)
    return " ".join(extracted_text), final_label


In [None]:
label_mapping.update({
    "INDIRECT_NOT_REVIEW": "IRRELEVANT"
})


In [None]:
# Process claims and extract labeled evidence
processed_data = []
for claim in claims_data:
    evidence_text, final_label = extract_evidence_text(claim["evidence"])
    processed_data.append({"text": claim["claim"] + " [SEP] " + evidence_text, "label": final_label})

# Convert to DataFrame
df = pd.DataFrame(processed_data)

# Save dataset for later use
df.to_csv("citation_classification_dataset.csv", index=False)

# Display dataset
df.head()


No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
No matching document found for cited docs: []
Sentence index 3 out of range for doc 36006
No matching document found for cited docs: []
Sentence index 3 out of range for doc 36006
No matching document found for cited docs: []
Sentence index 4 out of range for doc 36020
Sentence index 6 out of range for doc 36011
Sentence index 7 out of range for doc 36011
Sentence index 7 out of range for doc 36011
Sentence index 8 out of range for doc 36011
Sentence index 4 out of range for doc 36020
No matching document found for cited docs: []
Sentence index 3 out of range for doc 36015
Senten

Unnamed: 0,text,label
0,FMO3 and TMAO have emerged as key components o...,NOT_ACCURATE
1,In apoliprotein E-deficient mice fed a diet wi...,NOT_ACCURATE
2,"Dietary L-carnitine and choline, compounds abu...",ACCURATE
3,"While higher plasma levels of -carnitine, in ...",ACCURATE
4,TMAO could be derived from increased consumpti...,ACCURATE


In [None]:
df["label"].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ACCURATE,375
IRRELEVANT,132
NOT_ACCURATE,99


In [None]:
# Save the cleaned dataset as a CSV
df.to_csv("citation_classification_dataset.csv", index=False)

# Download the file
from google.colab import files
files.download("citation_classification_dataset.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


In [None]:
df.to_csv("citation_classification_dataset.csv", index=False)


In [None]:
df = pd.read_csv("citation_classification_dataset.csv")
df.head()


Unnamed: 0,text,label
0,FMO3 and TMAO have emerged as key components o...,NOT_ACCURATE
1,In apoliprotein E-deficient mice fed a diet wi...,NOT_ACCURATE
2,"Dietary L-carnitine and choline, compounds abu...",ACCURATE
3,"While higher plasma levels of -carnitine, in ...",ACCURATE
4,TMAO could be derived from increased consumpti...,ACCURATE


In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, stratify=df["label"], random_state=42)

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000, stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Check feature shape
X_train_tfidf.shape, X_test_tfidf.shape


((484, 3991), (122, 3991))

In [None]:
# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

# Predict on test data
y_pred = model.predict(X_test_tfidf)

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.7704918032786885
              precision    recall  f1-score   support

    ACCURATE       0.73      0.99      0.84        75
  IRRELEVANT       0.95      0.67      0.78        27
NOT_ACCURATE       1.00      0.10      0.18        20

    accuracy                           0.77       122
   macro avg       0.89      0.58      0.60       122
weighted avg       0.82      0.77      0.72       122



Trying again with oversampling to fix not_accurate scores

In [None]:
from imblearn.over_sampling import RandomOverSampler

# Oversample the minority class (NOT_ACCURATE)
oversampler = RandomOverSampler(sampling_strategy="not majority", random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train_tfidf, y_train)

# Train the model again
model = LogisticRegression(max_iter=1000)
model.fit(X_resampled, y_resampled)

# Predict on test set
y_pred_resampled = model.predict(X_test_tfidf)

# Evaluate again
print("Accuracy:", accuracy_score(y_test, y_pred_resampled))
print(classification_report(y_test, y_pred_resampled))


Accuracy: 0.7622950819672131
              precision    recall  f1-score   support

    ACCURATE       0.80      0.85      0.83        75
  IRRELEVANT       0.85      0.85      0.85        27
NOT_ACCURATE       0.40      0.30      0.34        20

    accuracy                           0.76       122
   macro avg       0.68      0.67      0.67       122
weighted avg       0.75      0.76      0.75       122



With random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_resampled, y_resampled)

# Predict on test data
y_pred_rf = rf_model.predict(X_test_tfidf)

# Evaluate performance
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.819672131147541
              precision    recall  f1-score   support

    ACCURATE       0.80      0.95      0.87        75
  IRRELEVANT       0.93      0.93      0.93        27
NOT_ACCURATE       0.67      0.20      0.31        20

    accuracy                           0.82       122
   macro avg       0.80      0.69      0.70       122
weighted avg       0.80      0.82      0.79       122



In [None]:
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Convert labels to numeric values
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_resampled)  # Train labels
y_test_encoded = label_encoder.transform(y_test)  # Test labels

# Train XGBoost model
xgb_model = XGBClassifier(n_estimators=300, random_state=42)
xgb_model.fit(X_resampled, y_train_encoded)

# Predict on test set
y_pred_xgb = xgb_model.predict(X_test_tfidf)

# Convert predictions back to text labels
y_pred_xgb_labels = label_encoder.inverse_transform(y_pred_xgb)

# Evaluate performance
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb_labels))
print(classification_report(y_test, y_pred_xgb_labels))



XGBoost Accuracy: 0.7868852459016393
              precision    recall  f1-score   support

    ACCURATE       0.81      0.89      0.85        75
  IRRELEVANT       0.87      0.96      0.91        27
NOT_ACCURATE       0.33      0.15      0.21        20

    accuracy                           0.79       122
   macro avg       0.67      0.67      0.66       122
weighted avg       0.74      0.79      0.76       122

