# Line Continuum Classifier

This notebook imports our pre-processed dataset on past projects at ALMA, and includes a target variable that indicates whether a project is classified as line (1) or a continuum (0) project. The project title and abstract are then combined and processed to remove unnecessary characters, and normalized formatting. The combined text is then vectorized using *sklearn.tfidf_vectorizer*, and used to train a logistic regression model. This model is then saved using *joblib* for future use with other data.

This notebook trains on the full data for use in production.

#### Capstone Group:

Arnav Boppudi

Ryan Lipps

Noah McIntire

Kaleigh O'hara

Brendan Puglisi

## Importing Dependencies


Below are the imports needed in order for this notebook to run. This includes:
* **pandas** - dataframe datatype and modification
* **numpy** - more built out numerical functions and datatypes, used alongside pandas
* **re and string** - used for text processing
* **sklearn** - used for data splitting, text vectorization, model building, and model statistics


In [1]:
import pandas as pd
import numpy as np  #for text pre-processing
import re, string
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, roc_curve, auc
from sklearn.feature_extraction.text import TfidfVectorizer

## Reading Data, Formatting
In order for this project to work with other data, only three columns are necessary:
* Title (project_title) - The project title of the project (predictor variable).
* Abstract (project_abstract) - The project abstract (predictor variable).
* Type (fs_type and target) - The type of project conducted, line or continuum (target variable).

Below, the project title and abstract are combined into the same column to act as one predictor variable, and so that both the title and abstract can be processed and vectorized at the same time. Additionally, the "Type" column in this case is target, as it is a numerical representation which class each project falls into.

In [2]:
df = pd.read_csv('../../data/raw_data/nrao_projects.csv')
#df['target'] = pd.get_dummies(df['fs_type'], dtype=int)['line']
df['text'] = df.project_title + ". " + df.project_abstract
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type,science_category,science_keyword,band,target,raw_text,standardized_text,no_sw_text,lemmatized_sw_text,lemmatized_no_sw_text,text
0,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line,Galaxy evolution,"Surveys of galaxies, Galaxy groups and clusters",6,1,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,jelly survey of nearby jellyfish and ram press...,jelly nearby jellyfish ram pressure stripped g...,jelly survey of nearby jellyfish and ram press...,jelly nearby jellyfish ram pressure strip gala...,ALMA JELLY - Survey of Nearby Jellyfish and Ra...
1,2022.1.01077.L,A SPectroscopic survey of biased halos In the ...,We propose to obtain deep ALMA 1.2mm mosaic ob...,line,Galaxy evolution,"Sub-mm Galaxies (SMG), High-z Active Galactic ...",6,1,A SPectroscopic survey of biased halos In the ...,a spectroscopic survey of biased halos in the ...,spectroscopic biased halos reionization era as...,a spectroscopic survey of biased halo in the r...,spectroscopic bias halos reionization era aspi...,A SPectroscopic survey of biased halos In the ...
2,2016.1.00324.L,ASPECS: The ALMA SPECtral line Survey in the U...,ASPECS represents an unparalleled three-dimens...,line,Galaxy evolution,Lyman Break Galaxies (LBG),3,1,ASPECS: The ALMA SPECtral line Survey in the U...,aspecs the spectral line survey in the udf an ...,aspecs spectral line udf program aspecs repres...,aspecs the spectral line survey in the udf an ...,aspecs spectral line udf program aspecs repres...,ASPECS: The ALMA SPECtral line Survey in the U...
3,2022.1.00875.L,The ALMA Disk-Exoplanet C/Onnection,Protoplanetary disks set the initial compositi...,line,Disks and planet formation,"Disks around low-mass stars, Exo-planets",7,1,The ALMA Disk-Exoplanet C/Onnection. Protoplan...,the disk exoplanet c onnection protoplanetary ...,disk exoplanet c onnection protoplanetary disk...,the disk exoplanet c onnection protoplanetary ...,disk exoplanet c onnection protoplanetary disk...,The ALMA Disk-Exoplanet C/Onnection. Protoplan...
4,2017.1.01355.L,ALMA-IMF: ALMA transforms our view of the orig...,Studying massive protoclusters is an absolute ...,line,ISM and star formation,"High-mass star formation, Low-mass star formation",6,1,ALMA-IMF: ALMA transforms our view of the orig...,imf transforms our view of the origin of stell...,imf transforms view origin stellar masses stud...,imf transform our view of the origin of stella...,imf transforms view origin stellar mass study ...,ALMA-IMF: ALMA transforms our view of the orig...


In [3]:
x = df['target'].value_counts()
print(x)

target
1    3628
0     900
Name: count, dtype: int64


## Text Preprocessing

Below, the preprocessing function does several things to our *text* column within the data. This includes:
1. Lowercasing every character in the string.
2. Removing leading and trailing white spaces.
3. Using regular expressions to match certain characters or words within the string, and replace then with whitespaces

The processed text is then added to the dataframe as the *clean_text* column.

In [4]:
#convert to lowercase, strip and remove punctuations
def preprocess(text):
    text = text.lower() 
    text = text.strip()  
    text = re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text

In [5]:
df['clean_text'] = df['text'].apply(lambda x: preprocess(x))
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type,science_category,science_keyword,band,target,raw_text,standardized_text,no_sw_text,lemmatized_sw_text,lemmatized_no_sw_text,text,clean_text
0,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line,Galaxy evolution,"Surveys of galaxies, Galaxy groups and clusters",6,1,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,jelly survey of nearby jellyfish and ram press...,jelly nearby jellyfish ram pressure stripped g...,jelly survey of nearby jellyfish and ram press...,jelly nearby jellyfish ram pressure strip gala...,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,alma jelly survey of nearby jellyfish and ram ...
1,2022.1.01077.L,A SPectroscopic survey of biased halos In the ...,We propose to obtain deep ALMA 1.2mm mosaic ob...,line,Galaxy evolution,"Sub-mm Galaxies (SMG), High-z Active Galactic ...",6,1,A SPectroscopic survey of biased halos In the ...,a spectroscopic survey of biased halos in the ...,spectroscopic biased halos reionization era as...,a spectroscopic survey of biased halo in the r...,spectroscopic bias halos reionization era aspi...,A SPectroscopic survey of biased halos In the ...,a spectroscopic survey of biased halos in the ...
2,2016.1.00324.L,ASPECS: The ALMA SPECtral line Survey in the U...,ASPECS represents an unparalleled three-dimens...,line,Galaxy evolution,Lyman Break Galaxies (LBG),3,1,ASPECS: The ALMA SPECtral line Survey in the U...,aspecs the spectral line survey in the udf an ...,aspecs spectral line udf program aspecs repres...,aspecs the spectral line survey in the udf an ...,aspecs spectral line udf program aspecs repres...,ASPECS: The ALMA SPECtral line Survey in the U...,aspecs the alma spectral line survey in the ud...
3,2022.1.00875.L,The ALMA Disk-Exoplanet C/Onnection,Protoplanetary disks set the initial compositi...,line,Disks and planet formation,"Disks around low-mass stars, Exo-planets",7,1,The ALMA Disk-Exoplanet C/Onnection. Protoplan...,the disk exoplanet c onnection protoplanetary ...,disk exoplanet c onnection protoplanetary disk...,the disk exoplanet c onnection protoplanetary ...,disk exoplanet c onnection protoplanetary disk...,The ALMA Disk-Exoplanet C/Onnection. Protoplan...,the alma disk exoplanet c onnection protoplane...
4,2017.1.01355.L,ALMA-IMF: ALMA transforms our view of the orig...,Studying massive protoclusters is an absolute ...,line,ISM and star formation,"High-mass star formation, Low-mass star formation",6,1,ALMA-IMF: ALMA transforms our view of the orig...,imf transforms our view of the origin of stell...,imf transforms view origin stellar masses stud...,imf transform our view of the origin of stella...,imf transforms view origin stellar mass study ...,ALMA-IMF: ALMA transforms our view of the orig...,alma imf alma transforms our view of the origi...


## Extracting Vectors

Below, we use *sklearn.tfidf_vectorizer* to create a list of vectors, each of which now represent a combined project title and abstract. This numerical representation of the data can then be used to train and test a machine learning model. 

TFIDF is a metric utilized in order to create a numerical representation of word usage throughout a corpus of text, utilizing word frequency within a specific document (in our case a project title and abstract) and comparing it to the word frequency throughout the entire corpus of documents (all projects included in this dataset).

In [6]:
#Tf-Idf vectorization
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
text_vectors_tfidf = tfidf_vectorizer.fit_transform(df.standardized_text)

## Fitting the Model

In [7]:
#FITTING THE CLASSIFICATION MODEL using Logistic Regression(tf-idf)
log_reg_model = LogisticRegression(solver='liblinear', C=10, penalty='l2')
log_reg_model.fit(text_vectors_tfidf, df.target)

## Example usage
For use in production, one simply needs to import the joblib files using something like the following

`tfidf_vectorizer = load('tfidf_vectorizer.joblib')`

`log_reg_model = load('log_model.joblib')`

and then use something like the following cells.

For now, we use the naming convention for the models defined in this notebook as opposed to the models imported from the joblib files.

In [8]:
test_vec = 'ALMA proposal to observe star formation in the milky way. With the upcoming ALMA proposal cycle, our team wishes to study star formation in the milky way by observing wccc signatures in possible formation regions. We will make spectral line measurements at 445GHz, 450GHz, in alma specific band 6.'
test_vec = preprocess(test_vec)
test_vec

'alma proposal to observe star formation in the milky way with the upcoming alma proposal cycle our team wishes to study star formation in the milky way by observing wccc signatures in possible formation regions we will make spectral line measurements at ghz ghz in alma specific band '

Sklearn logistic regression is easier to use with a one-row dataframe for individual incoming proposals.

In [9]:
input_frame = pd.DataFrame({
    'text':test_vec
}, index=[1])

In [10]:
prediction = log_reg_model.predict_proba(tfidf_vectorizer.transform(input_frame.text))
print(f'Predicted probability of only continuum measurements: {round(prediction[0][0]*100, 3)}')
print(f'Predicted probability of at least one line measurement: {round(prediction[0][1]*100, 3)}')

Predicted probability of only continuum measurements: 22.067
Predicted probability of at least one line measurement: 77.933


## Save Models

Once the tfidf and logistic regression models are trained, we can save them using joblib files to ensure that future proposals to be tested will be evaluated on the same model. This way you aren't re-training a model every time you want to run this notebook. An example of re-loading the vectorizer and logistic regression model can be found in model_import_tes.ipynb.

In [11]:
dump(tfidf_vectorizer, 'tfidf_vectorizer_logreg.joblib')
dump(log_reg_model, 'log_model.joblib')

['log_model.joblib']