# TF-IDF Feature Extraction

- Core Idea: reflect how important a word is to an instance in the dataset [1-Xuezhe Ma NLP USC Course]

In [1]:
import os

from sklearn.feature_extraction.text import TfidfVectorizer

## Decide data: Cleaned or Original

In [7]:
clean_data = False # Use this line without pipeline, so as a standalone notebook
# clean_data = os.getenv('MY_VARIABLE') # Use this line with pipeline, so NOT as a standalone notebook

if clean_data:
    cleaned_predictions_notebook = 'clean_generated_prediction.ipynb'
    %run $cleaned_predictions_notebook
    df = cleaner.df
else:
    generate_predictions_notebook = 'generate_prediction.ipynb'
    %run $generate_predictions_notebook
    df = pd.concat([predictions_df, non_predictions_df], ignore_index=True)

In [8]:
df

Unnamed: 0,Base Predictions,Prediction Label
0,"On Wednesday, November 20, 2024, Emily Chen forecasts that the revenue at Amazon (AMZN) will rise by 12% to $150 billion in Q2 of 2026.",1
1,"On Thursday, October 17, 2024, Liam Kim predicts that the operating cash flow at Microsoft (MSFT) should decrease by 2% to $50 billion in Q1 of 2027.",1
2,"On Friday, September 13, 2024, Ava Morales envisions that the stock price at Tesla (TSLA) will likely rise by 30% to $500 per share in Q4 of 2028.",1
3,"On Monday, August 19, 2024, Ethan Hall speculates that the dividend payout ratio at Procter & Gamble (PG) will probably remain at 60% in Q3 of 2026.",1
4,"On Tuesday, July 16, 2024, Mia Kim predicts that the research and development expenses at Johnson & Johnson (JNJ) may increase by 10% to $12 billion in FY 2029.",1
5,"On Wednesday, June 12, 2024, Julian Sanchez forecasts that the return on equity (ROE) at Visa (V) has a high probability of improving by 3% to 20% in Q2 of 2027.",1
6,"On Thursday, May 9, 2024, Logan Brooks predicts that the capital expenditures at Chevron (CVX) should decrease by 5% to $10 billion in Q1 of 2028.",1
7,"On Friday, April 12, 2024, Hannah Taylor envisions that the gross profit margin at Cisco Systems (CSCO) will likely expand by 2% to 25% in Q4 of 2026.",1
8,"On Monday, March 18, 2024, Detravious Lee forsees that the total debt at AT&T (T) will probably decrease by 8% to $150 billion in Q3 of 2029.",1
9,"On Tuesday, February 13, 2024, Raj Patel predicts that the earnings before interest and taxes (EBIT) at 3M (MMM) may increase by 6% to $10 billion in FY 2028.",1


- Document = each row
- Corpus = entire column

In [9]:
corpus_to_vectorize = df['Base Predictions']
corpus_to_vectorize

0                               On Wednesday, November 20, 2024, Emily Chen forecasts that the revenue at Amazon (AMZN) will rise by 12% to $150 billion in Q2 of 2026.
1                 On Thursday, October 17, 2024, Liam Kim predicts that the operating cash flow at Microsoft (MSFT) should decrease by 2% to $50 billion in Q1 of 2027.
2                    On Friday, September 13, 2024, Ava Morales envisions that the stock price at Tesla (TSLA) will likely rise by 30% to $500 per share in Q4 of 2028.
3                  On Monday, August 19, 2024, Ethan Hall speculates that the dividend payout ratio at Procter & Gamble (PG) will probably remain at 60% in Q3 of 2026.
4      On Tuesday, July 16, 2024, Mia Kim predicts that the research and development expenses at Johnson & Johnson (JNJ) may increase by 10% to $12 billion in FY 2029.
5     On Wednesday, June 12, 2024, Julian Sanchez forecasts that the return on equity (ROE) at Visa (V) has a high probability of improving by 3% to 20% in Q2 o

- `fit_transform()`: vectorizes the text data by converting it into numerical form.
    - `fit()`: Learns the vocab and IDF
    - `transform()`: Transforms the text data into a matrix of TF-IDF features using `fit()`
    - vectorize each row
    - output:
        - stored elements: #TF-IDF scores (thus not 0)  
        - rows: The #rows/#documents
        - columns: The #columns/#unique terms (features) in the entire corpus
            - NOT the same as the input #columns

In [10]:
vectorizer = TfidfVectorizer()
tf_idf_features = vectorizer.fit_transform(corpus_to_vectorize)

tf_idf_features

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 309 stored elements and shape (19, 165)>

In [11]:
# Convert the TF-IDF matrix to a dense matrix for easy viewing
dense_matrix = tf_idf_features.todense()

# Get the feature names (terms) learned by the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame to visualize the TF-IDF scores
tfidf_df = pd.DataFrame(dense_matrix, columns=feature_names)

# Print the shape of the complete TF-IDF DataFrame
print(tfidf_df.shape)

# Display the first few rows of the DataFrame
print(tfidf_df.head())

(19, 165)
         10        12        13       150        16        17   18       19  \
0  0.000000  0.188992  0.000000  0.229448  0.000000  0.000000  0.0  0.00000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.246917  0.0  0.00000   
2  0.000000  0.000000  0.202806  0.000000  0.000000  0.000000  0.0  0.00000   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.23201   
4  0.176582  0.161482  0.000000  0.000000  0.223488  0.000000  0.0  0.00000   

         20      2024  ...  thursday        to  total      tsla  tuesday  \
0  0.229448  0.126547  ...  0.000000  0.134095    0.0  0.000000  0.00000   
1  0.000000  0.119462  ...  0.216603  0.126588    0.0  0.000000  0.00000   
2  0.000000  0.111853  ...  0.000000  0.118525    0.0  0.231189  0.00000   
3  0.000000  0.112249  ...  0.000000  0.000000    0.0  0.000000  0.00000   
4  0.000000  0.108126  ...  0.000000  0.114576    0.0  0.000000  0.19605   

   visa  water  wednesday      will  writing  
0   0.0    