# **AI Society Week 3. Feature Generation**

## Jupyter Notebooks and Python
For workshops, we typically use **Google Colab**, a service that allows you to run python code in a nice containerized environment. It even provides free GPU/TPU runtimes!
___

In Google Colab, you write code using **Jupyter notebooks**. Notebooks are comprised of text and code cells, which can be run by hitting `Shift + Enter`. For your convenience, here are a few more nifty keyboard shortcuts:

* `b`: New cell below
* `a`: New cell above


## Hardware Needed:
Any computer with access to the internet and web browser

* 👉🏻Link to Dataset:
https://drive.google.com/file/d/1VDFkgayrdo--zKvZk5E-GLFRVsAGQ-qz/view?usp=sharing

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# Data loading in pandas Dataframe
data = pd.read_csv("/content/drive/MyDrive/AI Society Fall23/Week 3/Week3_data.csv")
data.head()

Unnamed: 0,event_name,datetime,description,host_org,event_perks,categories,location,event_weekend
0,'These Are Your First Steps': A Brief Introduc...,2024-09-18 10:00:00,Are you interested in learning how to code? Ar...,Library,,,Online,0
1,11th Annual Healthcare Panel Banquet,2024-11-14 18:00:00,The Black Medical Students Association present...,Black Medical Student Association,Food Stuff,"Club Meetings, In-Person Event",Memorial Union MU-202,0
2,11th Annual Poly Game Night,2024-08-23 18:00:00,Join ASU Library for the 11th Annual Poly Game...,Host Organizations Library Polytechnic,Food Stuff,"General, PAB Event, Welcome Event","Polytechnic campus Library5988 S. Backus Mall,...",0
3,2024 Annual SunMUN High School Conference,2024-11-15 08:00:00,Delegates from across Arizona and the Southwes...,Model United Nations at Arizona State University,,"Club Meetings, International, In-Person Event",State University Tempe Campus1151 S Forest Av...,0
4,7V7 and Powderpuff Football Tournament,2024-08-29 17:00:00,A football tournament where men and women will...,HOPE CHURCH MOVEMENT AT,,,SDFC West Field,0


In [None]:
data.tail()

Unnamed: 0,event_name,datetime,description,host_org,event_perks,categories,location,event_weekend
371,White Elephant,2024-11-11 18:00:00,To celebrate the holidays and connect students...,Business School Council,,,Deans Patio,0
372,Women IN Panel,2024-11-18 18:00:00,A panel hosting women with diverse professiona...,Business School Council,,,Business Administration (BA) 199,0
373,Women In Sports Panel,2024-10-03 18:00:00,Students will gain a deeper understanding of t...,Sports Business Association at,,,"BAC 215400 E Lemon St, Tempe, 85281",0
374,Yoga Workshop,2024-08-26 18:30:00,Incorporating a Yoga Workshop into a dance c...,Lasya at,Food Stuff,,Pitchforks StudiosSDFC,0
375,Young Life College Weekend,2024-10-18 17:00:00,"A weekend in Williams, Arizona focusing on fri...",Young Life at,,,Lost Canyon,0


In [None]:
data.shape

(376, 8)

In [None]:
data.columns

Index(['event_name', 'datetime', 'description', 'host_org', 'event_perks',
       'categories', 'location', 'event_weekend'],
      dtype='object')

## **Feature Transformation**

In [None]:
# Importing other libraries for feature transformation
import re
import string
!pip install contractions
import contractions
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
def preprocess_text(data):

    # Convert to lowercase
    if isinstance(data, pd.Series):
        data = data.astype(str).apply(lambda x: x.lower())
    else:
        data = data.lower()

    # Remove HTML tags
    data = re.sub(r'<.*?>', '', data)

    # Remove URLs
    data = re.sub(r'https?://[^\s]+', '', data)

    # Remove mentions
    data = re.sub(r'@\w+', '', data)

    # Remove hashtags
    data = re.sub(r'#\w+', '', data)

    # Remove special characters and punctuation
    data = re.sub(r'[^a-zA-Z0-9\s]', '', data)

    # Remove punctuation
    data = data.translate(str.maketrans('', '', string.punctuation))

    # Remove digits
    data = ''.join([i for i in data if not i.isdigit()])

    # Remove extra whitespace
    data = re.sub(r'\s+', ' ', data.strip())

    # Performing contractions
    data = contractions.fix(data)

    # Remove stop words using NLTK
    stop = nltk.corpus.stopwords.words('english')
    data = ' '.join([x for x in data.split() if x not in (stop)])

    return data

In [None]:
data.description = data.description.apply(preprocess_text)
data.head()

Unnamed: 0,event_name,datetime,description,host_org,event_perks,categories,location,event_weekend
0,'These Are Your First Steps': A Brief Introduc...,2024-09-18 10:00:00,interested learning code trying find free use ...,Library,,,Online,0
1,11th Annual Healthcare Panel Banquet,2024-11-14 18:00:00,black medical students association presents th...,Black Medical Student Association,Food Stuff,"Club Meetings, In-Person Event",Memorial Union MU-202,0
2,11th Annual Poly Game Night,2024-08-23 18:00:00,join asu library th annual poly game night fri...,Host Organizations Library Polytechnic,Food Stuff,"General, PAB Event, Welcome Event","Polytechnic campus Library5988 S. Backus Mall,...",0
3,2024 Annual SunMUN High School Conference,2024-11-15 08:00:00,delegates across arizona southwest gather simu...,Model United Nations at Arizona State University,,"Club Meetings, International, In-Person Event",State University Tempe Campus1151 S Forest Av...,0
4,7V7 and Powderpuff Football Tournament,2024-08-29 17:00:00,football tournament men women form teams compe...,HOPE CHURCH MOVEMENT AT,,,SDFC West Field,0


In [None]:
data.columns

Index(['event_name', 'datetime', 'description', 'host_org', 'event_perks',
       'categories', 'location', 'event_weekend'],
      dtype='object')

In [None]:
data.event_perks = data['event_perks'].astype(str)
data.categories = data['categories'].astype(str)
data.location = data['location'].astype(str)
data.host_org = data['host_org'].astype(str)

In [None]:
data.host_org = data.host_org.apply(preprocess_text)

In [None]:
data.event_perks = data.event_perks.apply(preprocess_text)

In [None]:
data.categories = data.categories.apply(preprocess_text)

In [None]:
data.location = data.location.apply(preprocess_text)

In [None]:
data.head()

Unnamed: 0,event_name,datetime,description,host_org,event_perks,categories,location,event_weekend
0,'These Are Your First Steps': A Brief Introduc...,2024-09-18 10:00:00,interested learning code trying find free use ...,library,,,online,0
1,11th Annual Healthcare Panel Banquet,2024-11-14 18:00:00,black medical students association presents th...,black medical student association,food stuff,club meetings inperson event,memorial union mu,0
2,11th Annual Poly Game Night,2024-08-23 18:00:00,join asu library th annual poly game night fri...,host organizations library polytechnic,food stuff,general pab event welcome event,polytechnic campus library backus mall mesa,0
3,2024 Annual SunMUN High School Conference,2024-11-15 08:00:00,delegates across arizona southwest gather simu...,model united nations arizona state university,,club meetings international inperson event,state university tempe campus forest avenue tempe,0
4,7V7 and Powderpuff Football Tournament,2024-08-29 17:00:00,football tournament men women form teams compe...,hope church movement,,,sdfc west field,0


In [None]:
data.isna().sum()

Unnamed: 0,0
event_name,0
datetime,0
description,0
host_org,0
event_perks,0
categories,0
location,0
event_weekend,0


## **Let's generate features**

# **TF-IDF Vectorization**

*In this section, we will use the `TfidfVectorizer` from the `sklearn` library to convert our preprocessed text data into TF-IDF features. This method helps in representing the importance of words in the documents relative to the entire corpus.*

## **What is TF-IDF?**

*TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.*

### **Formula**

- **TF (Term Frequency)**:
  TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

- **IDF (Inverse Document Frequency)**:
  IDF(t) = log(Total number of documents / Number of documents with term t in it)

- **TF-IDF**:
  TF-IDF(t) = TF(t) * IDF(t)

## **Using TfidfVectorizer**

*We use the `TfidfVectorizer` to transform the processed text into a TF-IDF matrix, which we then convert into a DataFrame.*

In [None]:
def feature_tf_idf(data):
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(max_features=10000)
    X = vectorizer.fit_transform(data)
    return X

In [None]:
data.columns

Index(['event_name', 'datetime', 'description', 'host_org', 'event_perks',
       'categories', 'location', 'event_weekend'],
      dtype='object')

In [None]:
tfidf_description = feature_tf_idf(data.description)
tfidf_perks = feature_tf_idf(data.event_perks)
tfidf_categories = feature_tf_idf(data.categories)
tfidf_location = feature_tf_idf(data.location)
tfidf_host_org = feature_tf_idf(data.host_org)

In [None]:
# Combine TF-IDF matrices
from scipy.sparse import hstack
feature_tfidf = hstack((tfidf_description, tfidf_perks, tfidf_categories, tfidf_location, tfidf_host_org))

In [None]:
feature_tfidf.shape

(376, 3604)

In [None]:
feature_tf_idf

Thank you for attending the workshop. We will continue next week.