# Line Continuum Classifier

This notebook imports our pre-processed dataset on past projects at ALMA, and includes a target variable that indicates whether a project is classified as line (1) or a continuum (0) project. The project title and abstract are then combined and processed to remove unnecessary characters, and normalized formatting. The combined text is then vectorized using *sklearn.tfidf_vectorizer*, and used to train a logistic regression model. This model is then saved using *joblib* for future use with other data.

#### Capstone Group:

Arnav Boppudi

Ryan Lipps

Noah McIntire

Kaleigh O'hara

Brendan Puglisi

## Importing Dependencies


Below are the imports needed in order for this notebook to run. This includes:
* **pandas** - dataframe datatype and modification
* **numpy** - more built out numerical functions and datatypes, used alongside pandas
* **re and string** - used for text processing
* **sklearn** - used for data splitting, text vectorization, model building, and model statistics


In [1]:
import pandas as pd
import numpy as np  #for text pre-processing
import re, string
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, roc_curve, auc
from sklearn.feature_extraction.text import TfidfVectorizer

## Reading Data, Formatting
In order for this project to work with other data, only three columns are necessary:
* Title (project_title) - The project title of the project (predictor variable).
* Abstract (project_abstract) - The project abstract (predictor variable).
* Type (fs_type and target) - The type of project conducted, line or continuum (target variable).

Below, the project title and abstract are combined into the same column to act as one predictor variable, and so that both the title and abstract can be processed and vectorized at the same time. Additionally, the "Type" column in this case is target, as it is a numerical representation which class each project falls into.

In [2]:
df = pd.read_csv('../data/nrao_projects.csv')
#df['target'] = pd.get_dummies(df['fs_type'], dtype=int)['line']
df['text'] = df.project_title + ". " + df.project_abstract
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type,target,text
0,2018.1.01205.L,Fifty AU STudy of the chemistry in the disk/en...,The huge variety of planetary systems discover...,line,1,Fifty AU STudy of the chemistry in the disk/en...
1,2022.1.00316.L,COMPASS: Complex Organic Molecules in Protosta...,The emergence of complex organic molecules in ...,line,1,COMPASS: Complex Organic Molecules in Protosta...
2,2017.1.00161.L,ALCHEMI: the ALMA Comprehensive High-resolutio...,A great variety in gas composition is observed...,line,1,ALCHEMI: the ALMA Comprehensive High-resolutio...
3,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line,1,ALMA JELLY - Survey of Nearby Jellyfish and Ra...
4,2021.1.00869.L,Bulge symmetry or not? The hidden dynamics of ...,A radio survey of red giant SiO sources in the...,line,1,Bulge symmetry or not? The hidden dynamics of ...


In [3]:
x = df['target'].value_counts()
print(x)

target
1    3628
0     900
Name: count, dtype: int64


## Text Preprocessing

Below, the preprocessing function does several things to our *text* column within the data. This includes:
1. Lowercasing every character in the string.
2. Removing leading and trailing white spaces.
3. Using regular expressions to match certain characters or words within the string, and replace then with whitespaces

The processed text is then added to the dataframe as the *clean_text* column.

In [4]:
#convert to lowercase, strip and remove punctuations
def preprocess(text):
    text = text.lower() 
    text = text.strip()  
    text = re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text

In [5]:
df['clean_text'] = df['text'].apply(lambda x: preprocess(x))
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type,target,text,clean_text
0,2018.1.01205.L,Fifty AU STudy of the chemistry in the disk/en...,The huge variety of planetary systems discover...,line,1,Fifty AU STudy of the chemistry in the disk/en...,fifty au study of the chemistry in the disk en...
1,2022.1.00316.L,COMPASS: Complex Organic Molecules in Protosta...,The emergence of complex organic molecules in ...,line,1,COMPASS: Complex Organic Molecules in Protosta...,compass complex organic molecules in protostar...
2,2017.1.00161.L,ALCHEMI: the ALMA Comprehensive High-resolutio...,A great variety in gas composition is observed...,line,1,ALCHEMI: the ALMA Comprehensive High-resolutio...,alchemi the alma comprehensive high resolution...
3,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line,1,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,alma jelly survey of nearby jellyfish and ram ...
4,2021.1.00869.L,Bulge symmetry or not? The hidden dynamics of ...,A radio survey of red giant SiO sources in the...,line,1,Bulge symmetry or not? The hidden dynamics of ...,bulge symmetry or not the hidden dynamics of t...


## Extracting Vectors
Below, we use the *test_train_split* function from *sklearn* to split our data into a training set of 3628 projects and a test set of 900 projects.

In [6]:
#SPLITTING THE TRAINING DATASET INTO TRAIN AND TEST
X_train, X_test, y_train, y_test = train_test_split(df["clean_text"],
                                                    df["target"],
                                                    test_size=900,
                                                    shuffle=True)

Below, we use *sklearn.tfidf_vectorizer* to create a list of vectors, each of which now represent a combined project title and abstract on both the training and testing data. This numerical representation of the data can then be used to train and test a machine learning model. 

TFIDF is a metric utilized in order to create a numerical representation of word usage throughout a corpus of text, utilizing word frequency within a specific document (in our case a project title and abstract) and comparing it to the word frequency throughout the entire corpus of documents (all projects included in this dataset).

In [7]:
#Tf-Idf vectorization
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) 
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)

## Fitting the Model and Summary Statistics

Below, we train and test a logistic regression model using the vectorized data. You can see the accuracy, as well as other statistics of the model in the output below the code chunk. Additionally, a confusion matrix from the test data can be seen below, with the true negative in the upper left corner, false positives in the upper right corner, false negtives in the lower right corner, and true postives in the lower left corner.

In [8]:
#FITTING THE CLASSIFICATION MODEL using Logistic Regression(tf-idf)
log_reg_model = LogisticRegression(solver='liblinear', C=10, penalty='l2')
log_reg_model.fit(X_train_vectors_tfidf, y_train)  

#Predict y value for test dataset
y_predict = log_reg_model.predict(X_test_vectors_tfidf)
y_prob = log_reg_model.predict_proba(X_test_vectors_tfidf)[:,1]

print(classification_report(y_test,y_predict))
print('Confusion Matrix:',confusion_matrix(y_test, y_predict))
 
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print('AUC:', roc_auc)

              precision    recall  f1-score   support

           0       0.80      0.59      0.68       175
           1       0.91      0.96      0.94       725

    accuracy                           0.89       900
   macro avg       0.85      0.78      0.81       900
weighted avg       0.89      0.89      0.89       900

Confusion Matrix: [[104  71]
 [ 26 699]]
AUC: 0.900248275862069


## Attach Model Probabilities to Test Data Frame

This can be used to see explicitly the project information paired with the model predictions.

In [12]:
# Attach model probabilities to test data frame
test_prob_df = df.loc[X_test.index]
test_prob_df['line_prob'] = y_prob
test_prob_df.sort_values('line_prob', ascending=False)

Unnamed: 0,project_code,project_title,project_abstract,fs_type,target,text,clean_text,line_prob
889,2016.1.00065.S,Probing the dense gas with HCN(5-4) in four SP...,The prodigious star formation rates observed i...,line,1,Probing the dense gas with HCN(5-4) in four SP...,probing the dense gas with hcn in four spt len...,0.999601
575,2013.1.00911.S,Molecular gas conditions and shocks in the sup...,We propose multi-line observations of the star...,line,1,Molecular gas conditions and shocks in the sup...,molecular gas conditions and shocks in the sup...,0.999470
1869,2016.1.00951.S,Filaments and Massive Star Formation,The filamentary nature of the interstellar med...,line,1,Filaments and Massive Star Formation. The fila...,filaments and massive star formation the filam...,0.999424
690,2015.1.01448.S,High Resolution Observations of the Dense Gas ...,We propose high spatial resolution (0.12''~60p...,line,1,High Resolution Observations of the Dense Gas ...,high resolution observations of the dense gas ...,0.999307
175,2013.1.00212.S,Detailed molecular gas distribution of an acti...,The aim of this observation is to reveal for t...,line,1,Detailed molecular gas distribution of an acti...,detailed molecular gas distribution of an acti...,0.999273
...,...,...,...,...,...,...,...,...
1282,2016.1.00115.S,Disentangle the polarization mechanims between...,We propose to observe the polarization of the ...,continuum,0,Disentangle the polarization mechanims between...,disentangle the polarization mechanims between...,0.028194
1270,2016.1.00201.S,Magnetohydrodynamic mechanisms of jets in the ...,Solar spicules are one of the jet phenomena re...,continuum,0,Magnetohydrodynamic mechanisms of jets in the ...,magnetohydrodynamic mechanisms of jets in the ...,0.026461
1630,2019.1.00580.S,Search for time variability in coronal syncrot...,"AGNs host hot plasma, namely coronae, emitting...",continuum,0,Search for time variability in coronal syncrot...,search for time variability in coronal syncrot...,0.021795
1269,2016.1.00202.S,Dynamics and energetics of the quiet-sun solar...,This proposal seeks to test models for heating...,continuum,0,Dynamics and energetics of the quiet-sun solar...,dynamics and energetics of the quiet sun solar...,0.017933


## Save Models

Once the tfidf and logistic regression models are trained, we can save them using joblib files to ensure that future proposals to be tested will be evaluated on the same model. This way you aren't re-training a model every time you want to run this notebook. An example of re-loading the vectorizer and logistic regression model can be found in model_import_tes.ipynb.

In [10]:
dump(tfidf_vectorizer, 'tfidf_vectorizer.joblib')
dump(log_reg_model, 'log_model.joblib')

['log_model.joblib']