## Recommender System Algorithm


### Objective
We want to help consumers find attorneys. To surface attorneys to consumers, sales consultants often have to help attorneys describe their areas of practice (areas like Criminal Defense, Business or Personal Injury). 
To expand their practices, attorneys can branch into related areas of practice. This can allow attorneys to help different customers while remaining within the bounds of their experience.

Attached is an anonymized dataset of attorneys and their specialties. The columns are anonymized attorney IDs and specialty IDs. Please design a process that returns the top 5 recommended practice areas for a given attorney with a set of specialties.

## Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import normalize

In [2]:
# Import data
data = pd.read_excel('data.xlsx', 'data')

In [10]:
data.shape

(200000, 2)

In [3]:
# View first few rows of the dataset
data.head()

Unnamed: 0,attorney_id,specialty_id
0,100000,218
1,100001,263
2,100001,436
3,100001,218
4,100001,481


## 3. Data Exploration

In [4]:
# Information of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
attorney_id     200000 non-null int64
specialty_id    200000 non-null int64
dtypes: int64(2)
memory usage: 3.1 MB


In [5]:
# Check missing values
data.isnull().sum()

attorney_id     0
specialty_id    0
dtype: int64

In [6]:
# Check duplicates
data.duplicated().sum()

0

In [9]:
# Check unique value count for the two ID's
data['attorney_id'].nunique(), data['specialty_id'].nunique()

(39369, 134)

In [23]:
data['specialty_id'].value_counts()

218    14780
258     9856
429     9608
257     8615
373     8245
713     7417
385     6554
712     6531
255     5988
428     5284
256     4781
433     4439
658     4109
205     3696
208     3608
720     3462
199     3295
436     3275
349     3144
241     3057
263     3015
202     2906
481     2671
668     2525
888     2352
901     2337
885     2253
253     2087
487     1862
723     1847
       ...  
337      214
702      212
437      198
229      187
243      180
267      177
496      175
264      160
471      160
435      144
724      137
910      133
667      132
909      129
654      116
262      113
680       93
109       90
434       90
665       73
729       70
427       69
268       62
448       56
301       43
430       38
679       32
884       29
423       17
192       14
Name: specialty_id, Length: 134, dtype: int64

In [14]:
# Check number of specialties per attorney
data.groupby('attorney_id')['specialty_id'].nunique().sort_values()

attorney_id
100000     1
115127     1
115125     1
148598     1
148606     1
148607     1
115118     1
148618     1
115104     1
148634     1
148639     1
115084     1
148649     1
148653     1
148660     1
115052     1
148667     1
115039     1
148668     1
115035     1
115017     1
148694     1
148697     1
148700     1
115003     1
148714     1
114997     1
114994     1
114979     1
148741     1
          ..
126223    23
107664    23
160740    23
157097    23
146173    23
108045    23
149245    24
105767    24
142656    24
122757    24
150076    24
126432    25
139731    25
122492    25
125866    25
120929    25
139282    25
163602    25
126320    25
162219    25
117051    25
118765    25
127883    25
107802    26
129433    26
149362    26
154524    26
147162    27
165340    28
157715    28
Name: specialty_id, Length: 39369, dtype: int64

The number of specialties of an attorney ranges from 1 to 28.

In [15]:
# View a sample: an attorney with 28 specialties
data[data['attorney_id']==157715]

Unnamed: 0,attorney_id,specialty_id
174885,157715,257
174886,157715,361
174887,157715,256
174888,157715,712
174889,157715,205
174890,157715,193
174891,157715,898
174892,157715,723
174893,157715,658
174894,157715,429


## Recommendation System

### Recommendation for Top K Practice Areas based on Similarity for Specialties

#### Step 1: Build the specialty-attorney matrix

In [16]:
# Build the specialty-attorney matrix
specialty_attorney = data.groupby(['specialty_id','attorney_id'])['attorney_id'].count().unstack(fill_value=0)
specialty_attorney = (specialty_attorney > 0).astype(int)
specialty_attorney

attorney_id,100000,100001,100002,100003,100005,100008,100010,100011,100012,100013,...,166168,166171,166173,166176,166177,166178,166179,166181,166183,166186
specialty_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
191,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
193,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
194,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
197,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
199,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
200,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Step 2: Build specialty-specialty similarity matrix

In [17]:
# Build specialty-specialty similarity matrix
specialty_attorney_norm = normalize(specialty_attorney, axis=1) 
similarity = np.dot(specialty_attorney_norm, specialty_attorney_norm.T)
df_similarity = pd.DataFrame(similarity, index=specialty_attorney.index, columns=specialty_attorney.index)

df_similarity

specialty_id,109,191,192,193,194,196,197,198,199,200,...,892,894,895,896,897,898,900,901,909,910
specialty_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,1.000000,0.006421,0.000000,0.010703,0.000000,0.000000,0.000000,0.023315,0.009182,0.245576,...,0.176335,0.000000,0.063357,0.005659,0.000000,0.009548,0.005326,0.030527,0.000000,0.000000
191,0.006421,1.000000,0.016280,0.018555,0.014162,0.014747,0.011247,0.016168,0.055182,0.020274,...,0.018771,0.014539,0.032952,0.008175,0.022923,0.079083,0.050781,0.045362,0.008045,0.021128
192,0.000000,0.016280,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.009312,0.000000,...,0.000000,0.034794,0.006426,0.000000,0.000000,0.000000,0.000000,0.000000,0.023531,0.000000
193,0.010703,0.018555,0.000000,1.000000,0.020656,0.018435,0.034370,0.022458,0.038030,0.019712,...,0.008940,0.024234,0.018308,0.008176,0.010917,0.015328,0.014107,0.014702,0.004470,0.000000
194,0.000000,0.014162,0.000000,0.020656,1.000000,0.021107,0.010732,0.025712,0.031390,0.003224,...,0.005118,0.017657,0.025154,0.003120,0.006250,0.015794,0.033771,0.024047,0.010235,0.005040
196,0.000000,0.014747,0.000000,0.018435,0.021107,1.000000,0.026074,0.026774,0.022142,0.010072,...,0.015986,0.005253,0.030557,0.003249,0.029284,0.021928,0.016818,0.023787,0.000000,0.005248
197,0.000000,0.011247,0.000000,0.034370,0.010732,0.026074,1.000000,0.010890,0.111507,0.000000,...,0.008128,0.040063,0.013317,0.003304,0.016544,0.024157,0.013993,0.014004,0.000000,0.005337
198,0.023315,0.016168,0.000000,0.022458,0.025712,0.026774,0.010890,1.000000,0.102498,0.017177,...,0.023369,0.013438,0.061687,0.007124,0.026160,0.018699,0.024585,0.053075,0.003895,0.011508
199,0.009182,0.055182,0.009312,0.038030,0.031390,0.022142,0.111507,0.102498,1.000000,0.020293,...,0.029910,0.145909,0.064083,0.019639,0.023414,0.062592,0.070414,0.073154,0.013804,0.025680
200,0.245576,0.020274,0.000000,0.019712,0.003224,0.010072,0.000000,0.017177,0.020293,1.000000,...,0.319893,0.016850,0.094688,0.005956,0.002982,0.023447,0.012611,0.053930,0.004884,0.014430


#### Step 3: Find the Top K most similar specialties

In [18]:
# Find the top k most similar specialties
def topk_specialty(specialty, similarity, k):
    result = similarity.loc[specialty].sort_values(ascending=False)[1:k + 1].reset_index()
    result = result.rename(columns={'specialty_id': 'Specialty_Recommend', specialty: 'Similarity'})
    return result

### Testing Recommender System based on Similarity
#### Process:
1. Ask user to input the ID of his/her obtained specialties
2. The system will recommend top 5 practice areas for the user's specialties based on similarity

In [24]:
# Test on a specialty sample 1
user_input1 = int(input('Please input your specialty ID: '))
recommend_user1 = topk_specialty(specialty=user_input1, similarity=df_similarity, k=5)
print('Top 5 recommended practice areas for user 1:')
print('--------------------------------------------')
print(recommend_user1)

Please input your specialty ID: 257
Top 5 recommended practice areas for user 1:
--------------------------------------------
   Specialty_Recommend  Similarity
0                  218    0.361041
1                  373    0.325820
2                  431    0.323956
3                  487    0.295870
4                  713    0.282852


In [25]:
# Test on a specialty sample 2
user_input2 = int(input('Please input your specialty ID: '))
recommend_user2 = topk_specialty(specialty=user_input2, similarity=df_similarity, k=5)
print('Top 5 recommended practice areas for user 2:')
print('--------------------------------------------')
print(recommend_user2)

Please input your specialty ID: 664
Top 5 recommended practice areas for user 2:
--------------------------------------------
   Specialty_Recommend  Similarity
0                  481    0.232267
1                  258    0.192996
2                  218    0.157602
3                  723    0.152190
4                  712    0.140602


### Popularity-based Recommendation - If user requests recommedation based on popularity

In [28]:
# Get ranked specialties based on popularity
df_specialty_popular = data_recommend.groupby('specialty_id')['attorney_id'].nunique().sort_values(ascending=False)
df_specialty_popular

#Q: data_recommend not defined

NameError: name 'data_recommend' is not defined

In [27]:
# Top 5 specialties based on popularity among attorneys
df_specialty_popular.columns = ['specialty_id', 'count_popular']
print('The 5 most popular specialties:')
print('--------------------------------')
print(df_specialty_popular.nlargest(5, keep='all'))

NameError: name 'df_specialty_popular' is not defined