## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

## Data Format

In [2]:
mailchimp_data = pd.read_csv('./data/MailChimp cleaned records headers.csv')
pd.set_option('max_columns', None)
mailchimp_data

Unnamed: 0,Email Address,First Name,Last Name,Board Member,Gender,Chapter,Reunion Year,Country,Degree,MEMBER_RATING,OPTIN_TIME,OPTIN_IP,CONFIRM_TIME,CONFIRM_IP,LATITUDE,LONGITUDE,GMTOFF,DSTOFF,TIMEZONE,CC,REGION,CLEAN_TIME,CLEAN_CAMPAIGN_TITLE,CLEAN_CAMPAIGN_ID,LEID,EUID,NOTES,TAGS


The mailchimp dataset comes from the salesforce dataset. This Cleaned dataset would only include emails that have bounced for one reason or another. It will not include any members who have opted out of the email service or those that are receiving emails without any complications. It is important to note that updating information within salesforce for those who have opted out may re-instantiate the email services so we are specifically working with cleaned/bounced records. Additionally, the datasets created for those who are Subscribed and Unsubscribed will have different column names than the Cleaned dataset here (ie Clean_Time, Clean_campaign_title, ect). Therefore any function created here may not work for those datasets. 

In [3]:
saa_pride_data = pd.read_excel('./data/SAA Pride member reports headings.xlsx')
saa_pride_data.reset_index(inplace = True)
saa_pride_data

Unnamed: 0,index,pref_mail_name,pref_class_year,home_city,home_state_code,home_country,home_phone_area_code,home_phone_number,home_email_address,bus_city,bus_state_code,bus_country,bus_phone_area_code,bus_phone_number,bus_email_address,first_name,last_name,pref_name_sort,email_switch,saa_email_address,gsb_email_address,other_email_address,pref_phone_area_code,pref_phone_number,pref_phone_addr_type,memb_status_desc,short_degree_string,parent_degree_string,short_degree_string_spouse,parent_degree_string_spouse,primary_sort_name,plan_name,primary_ind


The Stanford Alumni Association has it's own dataset that may or may not have additional or more recent data on some of the members. It may also have outdated data. Students are given an email but when they become alumni the email needs to be updated. Whether it is updated to specifically an 'alumni.stanford.edu' address or to another would be at the students discrepancy and isn't always updated.

### Pokemon Data

In [4]:
saa_poke = pd.read_excel('./data/SAA_Pokemon_FakeDB.xlsx')
saa_poke
# Filter necessary columns
saa_poke2 = saa_poke.filter(['home_country', 'home_email_address',\
                 'bus_email_address', 'first_name', 'last_name', 'email_switch',\
                'saa_email_address', 'gsb_email_address', 'other_email_address',])
saa_poke2

Unnamed: 0,home_country,home_email_address,bus_email_address,first_name,last_name,email_switch,saa_email_address,gsb_email_address,other_email_address
0,*,,,Growlithe,Ice,,*,,
1,China,weedleg4046@stanfordalumni.org,,Weedle,Grass,,w.grass5053@alumni.stanford.edu,,
2,Kuwait,aerodactyl.electric2974@alumni.stanford.edu,,Aerodactyl,Electric,,*,,
3,,*,,Pinsir,Fire,pinsirfire4582@gmail.com,*,,
4,USA,*,,Horsea,Ice,hice7313@stanfordalumni.org,,,
...,...,...,...,...,...,...,...,...,...
3995,Japan,*,,Smeargle,Electric,smeargleelectric9444@gmail.com,*,,
3996,,kabutops.steel1285@stanfordalumni.org,,Kabutops,Steel,k.steel5317@alumni.stanford.edu,,,
3997,Kuwait,*,,Slowking,Dragon,,slowkingd5563@stanfordalumni.org,,
3998,Japan,larvitar.electric9778@stanfordalumni.org,,Larvitar,Electric,l.electric7920@stanfordalumni.org,,,


In [5]:
mailchimp_poke = pd.read_csv('./data/Fake_MailChimp_cleaned_Pokemon.csv')
mailchimp_poke

Unnamed: 0,Email Address,First Name,Last Name,Board Member,Gender,Chapter,Reunion Year,Country,Degree,MEMBER_RATING,OPTIN_TIME,OPTIN_IP,CONFIRM_TIME,CONFIRM_IP,LATITUDE,LONGITUDE,GMTOFF,DSTOFF,TIMEZONE,CC,REGION,CLEAN_TIME,CLEAN_CAMPAIGN_TITLE,CLEAN_CAMPAIGN_ID,LEID,EUID,NOTES,TAGS
0,slakoth.normal3945@gmail.com,Slakoth,Normal,False,,Texas,,USA,,,,,,,,,,,,,,,,,,,,
1,e.rock7454@gmail.com,Espeon,Rock,True,F,DC Area,,United States,,,,,,,,,,,,,,,,,,,,
2,rhydonghost7966@alumni.stanford.edu,Rhydon,Ghost,False,M,Bay Area,,USA,MBA,,,,,,,,,,,,,,,,,,,
3,porygong9247@stanfordalumni.org,Porygon,Grass,False,M,Bay Area,,Japan,MS,,,,,,,,,,,,,,,,,,,
4,tangelagrass1376@gmail.com,Tangela,Grass,False,,New England,,United States,,,,,,,,,,,,,,,,,,,,
5,c.electric7518@gmail.com,Chansey,Steel,True,F,Other US,,USA,,,,,,,,,,,,,,,,,,,,
6,blissey.ghost4154@gmail.com,Blissey,Ghost,False,M,New England,,Macao Special Administrative Region of China,,,,,,,,,,,,,,,,,,,,


In [6]:
# Filter necessary columns
mailchimp_poke2 = mailchimp_poke.filter(['Email Address', 'First Name', 'Last Name',\
                    'Country'])
mailchimp_poke2

Unnamed: 0,Email Address,First Name,Last Name,Country
0,slakoth.normal3945@gmail.com,Slakoth,Normal,USA
1,e.rock7454@gmail.com,Espeon,Rock,United States
2,rhydonghost7966@alumni.stanford.edu,Rhydon,Ghost,USA
3,porygong9247@stanfordalumni.org,Porygon,Grass,Japan
4,tangelagrass1376@gmail.com,Tangela,Grass,United States
5,c.electric7518@gmail.com,Chansey,Steel,USA
6,blissey.ghost4154@gmail.com,Blissey,Ghost,Macao Special Administrative Region of China


In [7]:
mailchimp_poke2['handle'] = mailchimp_poke2['Email Address'].str.split('@').str[0]
mailchimp_poke2

Unnamed: 0,Email Address,First Name,Last Name,Country,handle
0,slakoth.normal3945@gmail.com,Slakoth,Normal,USA,slakoth.normal3945
1,e.rock7454@gmail.com,Espeon,Rock,United States,e.rock7454
2,rhydonghost7966@alumni.stanford.edu,Rhydon,Ghost,USA,rhydonghost7966
3,porygong9247@stanfordalumni.org,Porygon,Grass,Japan,porygong9247
4,tangelagrass1376@gmail.com,Tangela,Grass,United States,tangelagrass1376
5,c.electric7518@gmail.com,Chansey,Steel,USA,c.electric7518
6,blissey.ghost4154@gmail.com,Blissey,Ghost,Macao Special Administrative Region of China,blissey.ghost4154


In [8]:
def ohe(df, column):    
    for col in column:
        train = df[[col]]
        ohe = OneHotEncoder(sparse=False, handle_unknown="error")
        ohe.fit(train)
        encoded_train = ohe.transform(train)
        col_names = [f"{col}_{f}" for f in ohe.get_feature_names()]
        encoded_train = pd.DataFrame(encoded_train,
                                     columns=col_names, index=df.index)
        df = pd.concat([df, encoded_train], axis=1)
        
    return df

In [9]:
saa_poke2.fillna(value='Not Available', inplace=True)
emails = ['home_email_address', 'bus_email_address', 'email_switch', 'saa_email_address',\
         'gsb_email_address', 'other_email_address']
handles = []
for x in emails:
    for i in range(0,saa_poke2.shape[0]):
        if '@' in saa_poke2[x][i]:
            saa_poke2[x+'_handle'] = saa_poke2[x].str.split('@').str[0]
        else:
            saa_poke2[x+'_handle'] = saa_poke2[x]


In [10]:
emails

['home_email_address',
 'bus_email_address',
 'email_switch',
 'saa_email_address',
 'gsb_email_address',
 'other_email_address']

In [11]:
handles = []
for i in range(0,len(emails)):
    handles.append(emails[i]+'_handle')
handles

['home_email_address_handle',
 'bus_email_address_handle',
 'email_switch_handle',
 'saa_email_address_handle',
 'gsb_email_address_handle',
 'other_email_address_handle']

## Function

In [12]:
results_dict = {}
for i in range(0,mailchimp_poke2.shape[1]+1):
    target = mailchimp_poke2.iloc[i]
    target_dict = {'first_name': [target[1]], 'last_name': [target[2]],\
               emails[0]: [target[0]],\
               emails[1]: [target[0]],\
               emails[2]: [target[0]],\
               emails[3]: [target[0]],\
               emails[4]: [target[0]],\
               emails[5]: [target[0]],\
                   handles[0]: [target[4]],\
                   handles[1]: [target[4]],\
                   handles[2]: [target[4]],\
                   handles[3]: [target[4]],\
                   handles[4]: [target[4]],\
                   handles[5]: [target[4]],\
               'home_country': [target[3]]}
    df = pd.DataFrame.from_dict(target_dict)
    subset_saa = saa_poke2[saa_poke2['first_name'] == df.loc[0,'first_name']] 
    #the 0 is calling for row, so it does not return a series
    subset_saa_new = pd.concat([df,subset_saa], axis = 0)
    ohe_df = ohe(subset_saa_new, subset_saa_new.columns)
    ohe_df.drop(columns = subset_saa_new.columns, inplace = True)
    y = np.array(ohe_df.iloc[0])
    y = y.reshape(1,-1)
    cos_sim = cosine_similarity(ohe_df, y)
    cos_sim = pd.DataFrame(data=cos_sim, index=ohe_df.index).sort_values(by=0, ascending=False) #[1:]
    results = list(cos_sim.index)
    results_df = subset_saa_new.loc[results]
    results_dict[i] = results_df

In [13]:
target[4]

'c.electric7518'

In [14]:
subset_saa_new

Unnamed: 0,first_name,last_name,home_email_address,bus_email_address,email_switch,saa_email_address,gsb_email_address,other_email_address,home_email_address_handle,bus_email_address_handle,email_switch_handle,saa_email_address_handle,gsb_email_address_handle,other_email_address_handle,home_country
0,Chansey,Steel,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518,c.electric7518,c.electric7518,c.electric7518,c.electric7518,c.electric7518,USA
13,Chansey,Electric,*,Not Available,Not Available,c.electric7518@alumni.stanford.edu,Not Available,Not Available,*,Not Available,Not Available,c.electric7518,Not Available,Not Available,Kuwait
259,Chansey,Fairy,*,Not Available,Not Available,c.fairy2795@stanfordalumni.org,Not Available,Not Available,*,Not Available,Not Available,c.fairy2795,Not Available,Not Available,China
618,Chansey,Ghost,c.ghost4860@gmail.com,Not Available,cghost1428@alumni.stanford.edu,*,Not Available,Not Available,c.ghost4860@gmail.com,Not Available,cghost1428,*,Not Available,Not Available,USA
1143,Chansey,Fairy,c.fairy6570@gmail.com,Not Available,Not Available,*,Not Available,Not Available,c.fairy6570@gmail.com,Not Available,Not Available,*,Not Available,Not Available,Not Available
2200,Chansey,Fairy,Not Available,Not Available,Not Available,chanseyf4337@alumni.stanford.edu,Not Available,Not Available,Not Available,Not Available,Not Available,chanseyf4337,Not Available,Not Available,Not Available
2242,Chansey,Rock,*,Not Available,chansey.rock41@stanfordalumni.org,c.rock7818@alumni.stanford.edu,Not Available,Not Available,*,Not Available,chansey.rock41,c.rock7818,Not Available,Not Available,Not Available
2307,Chansey,Fairy,*,Not Available,chansey.fairy8796@stanfordalumni.org,c.fairy755@stanfordalumni.org,Not Available,Not Available,*,Not Available,chansey.fairy8796,c.fairy755,Not Available,Not Available,Japan
2374,Chansey,Fairy,Not Available,Not Available,chansey.fairy7147@alumni.stanford.edu,*,Not Available,Not Available,Not Available,Not Available,chansey.fairy7147,*,Not Available,Not Available,USA
2452,Chansey,Psychic,*,Not Available,chanseypsychic9835@alumni.stanford.edu,*,Not Available,Not Available,*,Not Available,chanseypsychic9835,*,Not Available,Not Available,United States


In [15]:
mailchimp_poke2

Unnamed: 0,Email Address,First Name,Last Name,Country,handle
0,slakoth.normal3945@gmail.com,Slakoth,Normal,USA,slakoth.normal3945
1,e.rock7454@gmail.com,Espeon,Rock,United States,e.rock7454
2,rhydonghost7966@alumni.stanford.edu,Rhydon,Ghost,USA,rhydonghost7966
3,porygong9247@stanfordalumni.org,Porygon,Grass,Japan,porygong9247
4,tangelagrass1376@gmail.com,Tangela,Grass,United States,tangelagrass1376
5,c.electric7518@gmail.com,Chansey,Steel,USA,c.electric7518
6,blissey.ghost4154@gmail.com,Blissey,Ghost,Macao Special Administrative Region of China,blissey.ghost4154


In [16]:
name_first = input('First Name: ')

results_dict[mailchimp_poke2[mailchimp_poke2['First Name']
                             == name_first].index[0]].head(5)

First Name: Chansey


Unnamed: 0,first_name,last_name,home_email_address,bus_email_address,email_switch,saa_email_address,gsb_email_address,other_email_address,home_email_address_handle,bus_email_address_handle,email_switch_handle,saa_email_address_handle,gsb_email_address_handle,other_email_address_handle,home_country
0,Chansey,Steel,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518@gmail.com,c.electric7518,c.electric7518,c.electric7518,c.electric7518,c.electric7518,c.electric7518,USA
13,Chansey,Electric,*,Not Available,Not Available,c.electric7518@alumni.stanford.edu,Not Available,Not Available,*,Not Available,Not Available,c.electric7518,Not Available,Not Available,Kuwait
618,Chansey,Ghost,c.ghost4860@gmail.com,Not Available,cghost1428@alumni.stanford.edu,*,Not Available,Not Available,c.ghost4860@gmail.com,Not Available,cghost1428,*,Not Available,Not Available,USA
2374,Chansey,Fairy,Not Available,Not Available,chansey.fairy7147@alumni.stanford.edu,*,Not Available,Not Available,Not Available,Not Available,chansey.fairy7147,*,Not Available,Not Available,USA
259,Chansey,Fairy,*,Not Available,Not Available,c.fairy2795@stanfordalumni.org,Not Available,Not Available,*,Not Available,Not Available,c.fairy2795,Not Available,Not Available,China


These are the 5 most likely matches.

## Next Steps

Next step would be to take a proactive approach to mitigate the number of emails that bounce in the future. We would suggest using the salesforce dataset to check for recent grads to reach out before they lose their student emails and ask for updated contact information and their subsequent plans after graduation. It would be easier to update records proactively when we still have accurate contact information. For subsequent plans after graduation, this would be to keeping location information of members to more accurately send regional events and functions. 