# Stanford Pride Database Matching System
System to alleviate member attrition

Authors:

- Saad Saeed [Github](https://github.com/ssaeed85) | [LinkedIn](https://www.linkedin.com/in/saadsaeed85/)
- Zach Rauch [Github](https://github.com/ZachRauch) | [LinkedIn](https://www.linkedin.com/in/zach-rauch/)
- Hanis Zulmuthi [Github](https://github.com/hanis-z) | [LinkedIn](https://www.linkedin.com/in/hanis-zulmuthi/)

- Xiaohua Su [Github](https://github.com/xiaohua-su) | [LinkedIn](https://www.linkedin.com/in/xiaohua-su/)

# Overview

Nonprofit organizations want to be able to bring new members and retain them.It is vital for organizations to keep in touch with its members who are the foundation to their networks through communications about events or news. Without any method of communication, members are
no longer in touch with the organization, and its activities and are considered 'lost'. A common issue that some organizations may have is that the email provided to the organization as the main means of communication may no longer work or gets bounced once the individual graduates from said institution such as colleges, and or bootcamp. Usually, an individual might forget about updating it before they are far away. As such updating the contact method is critical to keep them in the network. Overtime, this 'lost' member issue will get larger and larger for the organization.

The purpose of this project is to help Stanford Pride address such an issue. Stanford Pride currently has ~5000 members in their database. Unfortunately, Stanford Pride has lost contact with a small portion of its member. One way Stanford Pride recognizes that it has lost contact with a member that has not chosen to opt-out of newsletter is that the newsletters was bounced. According to Stanford Pride, their members are not all using the same platform. Some chose to have subscribed to either only emails, others are only on their Facebook, LinkedIn group and a small minority
interacts with Stanford using multiple platform. As such, Stanford Pride hopes to be able to rectify the issue of lost members by
updating the individual's contact information in order to bring/keep them in the network once again.

Our goal for this project is to help Stanford Pride be able to update this information in a more efficient way. We improved the efficiency by using a cosine similar model to provide a list of individuals from the Stanford Pride database with the individual from their Mailchimp database. This way, the chair in-charge of updating their database does not need to look up multiple potential people on their Stanford Database before deciding if they are the same individual. They now have a list of potential matches with information about them to compare against.

From Stanford Pride:
> A nonprofit organization, such as Stanford Pride, strives by attracting and retaining members.
> It is vital for the organization to stay in touch with its members.
> The main means to achieve this is the sending of newsletters via e-mail.
> Members are not likely to keep informed of the organization’s activity on their own. We only stay in their minds by regularly pushing news out to them.
Members do not always subscribe to other sources of information about the organization’s activities.
> For example, Stanford Pride has approximately 4,400 members in its database, out of which about 3,700 currently have valid e-mail addresses.
> Only 1,600 are part of our Facebook group, and 400 in our LinkedIn group.
> Therefore, our monthly e-mail newsletter is our sole means to reach about 2,100 members – almost half of our total membership.


# Imports

In [1]:
import pandas as pd
import numpy as np
import random 
import pycountry
np.random.seed = 42
random.seed(42)

pd.set_option('display.max_columns', None)
pd.set_option('display.min_rows', 100)
num_ofDesiredRecords = 4000

# Fake Dataset Creation

Due to the sensitive information, we decided to create a fake dataset to be able to work with for this project.

### SAA

In [2]:
def createRandomPhoneNumber_SAA():
    range_start = 10**9
    range_end = 10**10 -1
    num = str(random.randint(range_start,range_end))
    num_str = "{0} {1}-{2}".format(num[:3], num[3:7],num[7:])
    return num_str

In [3]:
def createRandomEmail(record,domainList=['@gmail.com','@stanfordalumni.org','@alumni.stanford.edu']):
    domain = random.choice(domainList)
    fName = record['first_name'].strip().lower()
    LName = record['last_name'].strip().lower()
    formats = []
    
    #Example: John Doe 123
    formats.append(fName[0]+LName+str(random.randint(10,9999))) #jdoe123
    formats.append(fName+LName+str(random.randint(10,9999))) #johndoe123
    formats.append(fName+LName[0]+str(random.randint(10,9999))) #johnd123
    formats.append(fName[0]+'.'+LName+str(random.randint(10,9999))) #j.doe123
    formats.append(fName+'.'+LName+str(random.randint(10,9999))) #john.doe123
    
    return random.choice(formats)+domain

In [4]:
def createDegreeString_SAA():
    degree_list = ['MS','MA','MBA','MD','PhD','BA','BS','JD','']
    degree_years = "'" + str(random.randint(80,99))
    degree = random.choice(degree_list) + ' ' + degree_years
    return degree.strip()

In [5]:
city = ['Chicago', 'Boston',  'Madrid', 'Tokyo', 'Seoul', 'London','Beijing','Shanghai','Dubai','*',np.nan,'']
state = ['NY', 'WA', 'TX','CA','NM','*',np.nan,'']
country = ['Japan', 'United States', 'USA', 'China', 'Kuwait','*',np.nan,'']

In [6]:
common_FNames = [
'Bulbasaur',
'Ivysaur',
'Venusaur',
'Charmander',
'Charmeleon',
'Charizard',
'Squirtle',
'Wartortle',
'Blastoise',
'Caterpie',
'Metapod',
'Butterfree',
'Weedle',
'Kakuna',
'Beedrill',
'Pidgey',
'Pidgeotto',
'Pidgeot',
'Rattata',
'Raticate',
'Spearow',
'Fearow',
'Ekans',
'Arbok',
'Pikachu',
'Raichu',
'Sandshrew',
'Sandslash',
'Nidoran♀',
'Nidorina',
'Nidoqueen',
'Nidoran♂',
'Nidorino',
'Nidoking',
'Clefairy',
'Clefable',
'Vulpix',
'Ninetales',
'Jigglypuff',
'Wigglytuff',
'Zubat',
'Golbat',
'Oddish',
'Gloom',
'Vileplume',
'Paras',
'Parasect',
'Venonat',
'Venomoth',
'Diglett',
'Dugtrio',
'Meowth',
'Persian',
'Psyduck',
'Golduck',
'Mankey',
'Primeape',
'Growlithe',
'Arcanine',
'Poliwag',
'Poliwhirl',
'Poliwrath',
'Abra',
'Kadabra',
'Alakazam',
'Machop',
'Machoke',
'Machamp',
'Bellsprout',
'Weepinbell',
'Victreebel',
'Tentacool',
'Tentacruel',
'Geodude',
'Graveler',
'Golem',
'Ponyta',
'Rapidash',
'Slowpoke',
'Slowbro',
'Magnemite',
'Magneton',
'Doduo',
'Dodrio',
'Seel',
'Dewgong',
'Grimer',
'Muk',
'Shellder',
'Cloyster',
'Gastly',
'Haunter',
'Gengar',
'Onix',
'Drowzee',
'Hypno',
'Krabby',
'Kingler',
'Voltorb',
'Electrode',
'Exeggcute',
'Exeggutor',
'Cubone',
'Marowak',
'Hitmonlee',
'Hitmonchan',
'Lickitung',
'Koffing',
'Weezing',
'Rhyhorn',
'Rhydon',
'Chansey',
'Tangela',
'Kangaskhan',
'Horsea',
'Seadra',
'Goldeen',
'Seaking',
'Staryu',
'Starmie',
'Mr. Mime',
'Scyther',
'Jynx',
'Electabuzz',
'Magmar',
'Pinsir',
'Tauros',
'Magikarp',
'Gyarados',
'Lapras',
'Ditto',
'Eevee',
'Vaporeon',
'Jolteon',
'Flareon',
'Porygon',
'Omanyte',
'Omastar',
'Kabuto',
'Kabutops',
'Aerodactyl',
'Snorlax',
'Articuno',
'Zapdos',
'Moltres',
'Dratini',
'Dragonair',
'Dragonite',
'Mewtwo',
'Mew',
'Chikorita',
'Bayleef',
'Meganium',
'Cyndaquil',
'Quilava',
'Typhlosion',
'Totodile',
'Croconaw',
'Feraligatr',
'Sentret',
'Furret',
'Hoothoot',
'Noctowl',
'Ledyba',
'Ledian',
'Spinarak',
'Ariados',
'Crobat',
'Chinchou',
'Lanturn',
'Pichu',
'Cleffa',
'Igglybuff',
'Togepi',
'Togetic',
'Natu',
'Xatu',
'Mareep',
'Flaaffy',
'Ampharos',
'Bellossom',
'Marill',
'Azumarill',
'Sudowoodo',
'Politoed',
'Hoppip',
'Skiploom',
'Jumpluff',
'Aipom',
'Sunkern',
'Sunflora',
'Yanma',
'Wooper',
'Quagsire',
'Espeon',
'Umbreon',
'Murkrow',
'Slowking',
'Misdreavus',
'Unown',
'Wobbuffet',
'Girafarig',
'Pineco',
'Forretress',
'Dunsparce',
'Gligar',
'Steelix',
'Snubbull',
'Granbull',
'Qwilfish',
'Scizor',
'Shuckle',
'Heracross',
'Sneasel',
'Teddiursa',
'Ursaring',
'Slugma',
'Magcargo',
'Swinub',
'Piloswine',
'Corsola',
'Remoraid',
'Octillery',
'Delibird',
'Mantine',
'Skarmory',
'Houndour',
'Houndoom',
'Kingdra',
'Phanpy',
'Donphan',
'Porygon2',
'Stantler',
'Smeargle',
'Tyrogue',
'Hitmontop',
'Smoochum',
'Elekid',
'Magby',
'Miltank',
'Blissey',
'Raikou',
'Entei',
'Suicune',
'Larvitar',
'Pupitar',
'Tyranitar',
'Lugia',
'Ho-Oh',
'Celebi',
'Treecko',
'Grovyle',
'Sceptile',
'Torchic',
'Combusken',
'Blaziken',
'Mudkip',
'Marshtomp',
'Swampert',
'Poochyena',
'Mightyena',
'Zigzagoon',
'Linoone',
'Wurmple',
'Silcoon',
'Beautifly',
'Cascoon',
'Dustox',
'Lotad',
'Lombre',
'Ludicolo',
'Seedot',
'Nuzleaf',
'Shiftry',
'Taillow',
'Swellow',
'Wingull',
'Pelipper',
'Ralts',
'Kirlia',
'Gardevoir',
'Surskit',
'Masquerain',
'Shroomish',
'Breloom',
'Slakoth',
'Vigoroth',
'Slaking',
'Nincada',
'Ninjask',
'Shedinja',
'Whismur',
'Loudred',
'Exploud',
'Makuhita',
'Hariyama',
'Azurill',
'Nosepass',   
]

In [7]:
common_LNames = ['Normal',
'Fire',
'Water',
'Grass',
'Electric',
'Ice',
'Fighting',
'Poison',
'Ground',
'Flying',
'Psychic',
'Bug',
'Rock',
'Ghost',
'Dragon',
'Steel',
'Dark',
'Fairy']

In [8]:
df_SAA = pd.read_excel("./data/SAA Pride member reports headings.xlsx").append([np.nan]*num_ofDesiredRecords)

  df_SAA = pd.read_excel("./data/SAA Pride member reports headings.xlsx").append([np.nan]*num_ofDesiredRecords)


In [9]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(common_FNames))
    
df_SAA.first_name = __

In [10]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(common_LNames))
    
df_SAA.last_name = __

In [11]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choices([createRandomPhoneNumber_SAA(),'*',np.nan],weights=(1,5,5))[0])
    
df_SAA.home_phone_number = __

In [12]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(['*',np.nan,createRandomEmail(df_SAA.iloc[_])]))
    
df_SAA.home_email_address = __


__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice([createRandomEmail(df_SAA.iloc[_]),np.nan]))
    
df_SAA.email_switch = __

In [13]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(['*',np.nan,createRandomEmail(df_SAA.iloc[_],['@stanfordalumni.org','@alumni.stanford.edu'])]))
    
df_SAA.saa_email_address = __

In [14]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choices([createRandomEmail(df_SAA.iloc[_]),np.nan],weights=(1, 50))[0])
    
df_SAA.gsb_email_address = __

In [15]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choices([createRandomEmail(df_SAA.iloc[_]),np.nan],weights=(1, 10))[0])
    
df_SAA.bus_email_address = __

In [16]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(city))
    
df_SAA.home_city = __

In [17]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(country))
    
df_SAA.home_country = __

In [18]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(country))
    
df_SAA.bus_country = __

In [19]:
state = {'Chicago': 'IL', 
           'Boston': 'MA', 
           'New York' : 'NY', 
           'San Francisco': 'CA', 
           'Los Angeles' : 'CA', 
           'Austin' : 'TX',
        'Dallas': 'TX',
        'Denver': 'CO',
        '':'',
        '*':'*'}

df_SAA.home_state_code = df_SAA['home_city'].map(state)

In [20]:
__  = []
for _ in range(0,num_ofDesiredRecords):
    __.append(random.choice(['',np.nan,random.randint(1990, 2018)]))
    
df_SAA.pref_class_year = __


In [21]:
degrees = []
for i in range(0,num_ofDesiredRecords):
    num_draw = random.randint(0,3)
    degree = np.NaN
    k=0
    while k < num_draw:
        if k == 0:
            degree = createDegreeString_SAA()
        else:
            degree = degree +', '+ createDegreeString_SAA()
        k+=1
    degrees.append(degree)
    
df_SAA.short_degree_string = degrees

In [22]:
# df_SAA.to_excel('data/SAA_FakeDB.xlsx',index = False)

In [23]:
df_SAA.drop(columns=[0]).to_excel('data/SAA_Pokemon_FakeDB.xlsx',index = False)

In [24]:
df_SAA.loc[74]

pref_mail_name                                                    NaN
pref_class_year                                                      
home_city                                                       Seoul
home_state_code                                                   NaN
home_country                                                      USA
home_phone_area_code                                              NaN
home_phone_number                                                   *
home_email_address             slakoth.normal3945@alumni.stanford.edu
bus_city                                                          NaN
bus_state_code                                                    NaN
bus_country                                                     Japan
bus_phone_area_code                                               NaN
bus_phone_number                                                  NaN
bus_email_address                                                 NaN
first_name          

### Mailchimp

Making Mailchimp records that matches up to a degree to some Stanford records.

In [25]:
df_mc = pd.read_csv('data/MailChimp cleaned records headers.csv')
df_mc

Unnamed: 0,Email Address,First Name,Last Name,Board Member,Gender,Chapter,Reunion Year,Country,Degree,MEMBER_RATING,OPTIN_TIME,OPTIN_IP,CONFIRM_TIME,CONFIRM_IP,LATITUDE,LONGITUDE,GMTOFF,DSTOFF,TIMEZONE,CC,REGION,CLEAN_TIME,CLEAN_CAMPAIGN_TITLE,CLEAN_CAMPAIGN_ID,LEID,EUID,NOTES,TAGS


In [26]:
df_mc = df_mc.append([np.nan]*5).drop(columns=0)

  df_mc = df_mc.append([np.nan]*5).drop(columns=0)


In [27]:
# First record. No major difference other than email
rec = df_SAA.loc[74]
i = 0
df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = rec.home_email_address.split('@')[0]+'@gmail.com'
df_mc.loc[i,'Country'] = rec.home_country
df_mc.loc[i,'Degree'] = random.choice([np.nan,''])

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [28]:
# Second record. USA == United States. Different email
rec = df_SAA.loc[34]
i=1

df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = createRandomEmail(rec)
df_mc.loc[i,'Country'] = 'United States'
# df_mc.loc[i,'Degree'] = random.choice([np.nan,''])

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

# Third record. Missing 1 degree. Different country. Different email
rec = df_SAA.loc[30]
i=2
df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = 'rhydonghost7966@alumni.stanford.edu'
df_mc.loc[i,'Country'] = 'USA'
df_mc.loc[i,'Degree'] = 'MBA'

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [29]:
# 4th record. Has a degree on mail chimp side
rec = df_SAA.loc[92]
i=3
df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = createRandomEmail(rec)
df_mc.loc[i,'Country'] = 'Japan'
df_mc.loc[i,'Degree'] = 'MS'

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [30]:
# 5th record. Missing all degrees
rec = df_SAA.loc[101]
i=4

df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = createRandomEmail(rec)
df_mc.loc[i,'Country'] = 'United States'
df_mc.loc[i,'Degree'] = random.choice([np.nan,''])

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [31]:
# 6th record. Changed last name
rec = df_SAA.loc[13]
i=5
df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = random.choice(common_LNames)
df_mc.loc[i,'Email Address'] = rec.saa_email_address.split('@')[0]+'@gmail.com'
df_mc.loc[i,'Country'] = 'USA'
df_mc.loc[i,'Degree'] = random.choice([np.nan,''])

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [32]:
# 7th record. mostly empty mail chimp record
random_country = list(pycountry.countries)
random.shuffle(random_country)

rec = df_SAA.loc[600]
i=6
df_mc.loc[i,'First Name'] = rec.first_name
df_mc.loc[i,'Last Name'] = rec.last_name
df_mc.loc[i,'Email Address'] = rec.email_switch.split('@')[0]+'@gmail.com'
df_mc.loc[i,'Country'] = random_country[0].official_name
df_mc.loc[i,'Degree'] = random.choice([np.nan,''])

df_mc.loc[i,'Board Member'] = random.choice([True,False])
df_mc.loc[i,'Gender'] = random.choice(['F','M',np.nan])
df_mc.loc[i,'Chapter'] = random.choice(['Other US','Texas','Bay Area','DC Area','New England'])

In [33]:
df_mc

Unnamed: 0,Email Address,First Name,Last Name,Board Member,Gender,Chapter,Reunion Year,Country,Degree,MEMBER_RATING,OPTIN_TIME,OPTIN_IP,CONFIRM_TIME,CONFIRM_IP,LATITUDE,LONGITUDE,GMTOFF,DSTOFF,TIMEZONE,CC,REGION,CLEAN_TIME,CLEAN_CAMPAIGN_TITLE,CLEAN_CAMPAIGN_ID,LEID,EUID,NOTES,TAGS
0,slakoth.normal3945@gmail.com,Slakoth,Normal,False,,Texas,,USA,,,,,,,,,,,,,,,,,,,,
1,e.rock7454@gmail.com,Espeon,Rock,True,F,DC Area,,United States,,,,,,,,,,,,,,,,,,,,
2,rhydonghost7966@alumni.stanford.edu,Rhydon,Ghost,False,M,Bay Area,,USA,MBA,,,,,,,,,,,,,,,,,,,
3,porygong9247@stanfordalumni.org,Porygon,Grass,False,M,Bay Area,,Japan,MS,,,,,,,,,,,,,,,,,,,
4,tangelagrass1376@gmail.com,Tangela,Grass,False,,New England,,United States,,,,,,,,,,,,,,,,,,,,
5,c.electric7518@gmail.com,Chansey,Steel,True,F,Other US,,USA,,,,,,,,,,,,,,,,,,,,
6,blissey.ghost4154@gmail.com,Blissey,Ghost,False,M,New England,,Macao Special Administrative Region of China,,,,,,,,,,,,,,,,,,,,


In [34]:
df_mc.to_csv('data/Fake_MailChimp_cleaned_Pokemon.csv',index=False)