## Initial Data Cleaning

- Filled missing values
- Filtered
    - Status, location, age

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [43]:
# read in data from pickle file

cupid_df = pd.read_pickle('data/subset_cupid.pkl')

In [18]:
cupid_df.head()

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,location,offspring,pets,religion,sign,smokes
0,22,single,m,straight,a little extra,strictly anything,socially,never,working on college/university,"south san francisco, california","doesn't have kids, but might want them",likes dogs and likes cats,agnosticism and very serious about it,gemini,sometimes
1,35,single,m,straight,average,mostly other,often,sometimes,working on space camp,"oakland, california","doesn't have kids, but might want them",likes dogs and likes cats,agnosticism but not too serious about it,cancer,no
2,38,available,m,straight,thin,anything,socially,,graduated from masters program,"san francisco, california",,has cats,,pisces but it doesn&rsquo;t matter,no
3,23,single,m,straight,thin,vegetarian,socially,,working on college/university,"berkeley, california",doesn't want kids,likes cats,,pisces,no
4,29,single,m,straight,athletic,,socially,never,graduated from college/university,"san francisco, california",,likes dogs and likes cats,,aquarius,no


#### Further filtering data
- Dropping 'education' and 'sign' features, as I personally felt that the values were odd ('education') or not indicative of recommending a partner ('sign')
- Filtering 'status' to only 'single' or 'available', since we're recommending lovers and 'married' or 'seeing someone' already have a person of interest, assuming relationships aren't polyamorous

In [44]:
# drop education + sign -- useless / not indicative, in my opinion

cupid_df.drop(columns = ['education', 'sign'], inplace = True)

In [55]:
cupid_df.shape

(59946, 13)

In [56]:
cupid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          59946 non-null  category
 1   status       59946 non-null  category
 2   sex          59946 non-null  category
 3   orientation  59946 non-null  category
 4   body_type    54650 non-null  category
 5   diet         35551 non-null  category
 6   drinks       56961 non-null  category
 7   drugs        45866 non-null  category
 8   location     59946 non-null  category
 9   offspring    24385 non-null  category
 10  pets         40025 non-null  category
 11  religion     39720 non-null  category
 12  smokes       54434 non-null  category
dtypes: category(13)
memory usage: 836.6 KB


In [20]:
cupid_df['status'].value_counts()

single            55697
seeing someone     2064
available          1865
married             310
unknown              10
Name: status, dtype: int64

In [45]:
# filter for only those who are 'single' or 'available'

cupid = cupid_df[(cupid_df['status'] == 'single') | (cupid_df['status'] == 'available')]

In [22]:
# dropped ~2k rows

cupid.shape

(57562, 13)

#### Null Values

In [23]:
# check for nulls

cupid.isna().sum()

age                0
status             0
sex                0
orientation        0
body_type       4867
diet           23136
drinks          2918
drugs          13508
location           0
offspring      33881
pets           19384
religion       19656
smokes          5361
dtype: int64

In [46]:
# impute missing values

cupid['body_type'].fillna('rather not say', inplace = True)
cupid['diet'].fillna('anything', inplace = True)
cupid['drinks'].fillna('not at all', inplace = True)
cupid['drugs'].fillna('never', inplace = True)
cupid['offspring'].fillna("doesn't have kids", inplace = True)
cupid['pets'].fillna('dislikes dogs and dislikes cats', inplace = True)
cupid['religion'].fillna('atheism', inplace = True)
cupid['smokes'].fillna('no', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [25]:
# ensure null values are handled

cupid.isna().sum()

age            0
status         0
sex            0
orientation    0
body_type      0
diet           0
drinks         0
drugs          0
location       0
offspring      0
pets           0
religion       0
smokes         0
dtype: int64

In [15]:
# filter for age
# entries where age is 109 + 110

cupid['age'].value_counts()

26     3537
27     3518
28     3396
25     3393
29     3147
24     3106
30     3012
31     2619
23     2463
32     2446
33     2115
22     1848
34     1826
35     1683
36     1530
37     1381
38     1289
21     1217
39     1139
42     1049
40     1002
41      954
20      915
43      828
44      683
45      631
19      593
46      568
47      514
48      471
49      450
50      421
51      342
52      339
18      298
56      268
54      261
55      260
57      253
53      246
59      218
58      192
60      189
61      172
62      166
63      135
64      112
65      107
66      103
67       66
68       58
69       31
109       1
110       1
Name: age, dtype: int64

In [47]:
# change to data type int
cupid['age'] = cupid['age'].astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [48]:
# filter ages 109 + 110
cupid = cupid[cupid['age'] < 109]

In [49]:
# filter for location only in california
cupid = cupid[cupid['location'].str.contains('california')]

In [80]:
cupid['location'].value_counts()

san francisco, california              29918
oakland, california                     6886
berkeley, california                    3979
san mateo, california                   1291
palo alto, california                   1013
alameda, california                      868
san rafael, california                   733
hayward, california                      710
emeryville, california                   706
daly city, california                    663
redwood city, california                 654
san leandro, california                  620
walnut creek, california                 618
vallejo, california                      537
menlo park, california                   452
south san francisco, california          405
richmond, california                     399
mountain view, california                365
novato, california                       361
burlingame, california                   348
pleasant hill, california                333
castro valley, california                333
stanford, 

In [29]:
cupid.shape

(57473, 13)

In [30]:
cupid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57473 entries, 0 to 59945
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          57473 non-null  int32   
 1   status       57473 non-null  category
 2   sex          57473 non-null  category
 3   orientation  57473 non-null  category
 4   body_type    57473 non-null  category
 5   diet         57473 non-null  category
 6   drinks       57473 non-null  category
 7   drugs        57473 non-null  category
 8   location     57473 non-null  category
 9   offspring    57473 non-null  category
 10  pets         57473 non-null  category
 11  religion     57473 non-null  category
 12  smokes       57473 non-null  category
dtypes: category(12), int32(1)
memory usage: 1.4 MB


In [36]:
cupid.to_pickle('data/clean_cupid.pkl')

In [37]:
cupid_check = pd.read_pickle('data/clean_cupid.pkl')

In [38]:
cupid_check.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57473 entries, 0 to 59945
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   age          57473 non-null  int32   
 1   status       57473 non-null  category
 2   sex          57473 non-null  category
 3   orientation  57473 non-null  category
 4   body_type    57473 non-null  category
 5   diet         57473 non-null  category
 6   drinks       57473 non-null  category
 7   drugs        57473 non-null  category
 8   location     57473 non-null  category
 9   offspring    57473 non-null  category
 10  pets         57473 non-null  category
 11  religion     57473 non-null  category
 12  smokes       57473 non-null  category
dtypes: category(12), int32(1)
memory usage: 1.4 MB
