## Recommender System Algorithm


### Objective
We want to help consumers find attorneys. To surface attorneys to consumers, sales consultants often have to help attorneys describe their areas of practice (areas like Criminal Defense, Business or Personal Injury). 
To expand their practices, attorneys can branch into related areas of practice. This can allow attorneys to help different customers while remaining within the bounds of their experience.

Attached is an anonymized dataset of attorneys and their specialties. The columns are anonymized attorney IDs and specialty IDs. Please design a process that returns the top 5 recommended practice areas for a given attorney with a set of specialties.

## Data

In [22]:
import pandas as pd
from sklearn.preprocessing import normalize
import numpy as np

In [9]:
# Import data
data = pd.read_excel('data.xlsx', 'data')

In [10]:
# View first few rows of the dataset
data.head()

Unnamed: 0,attorney_id,specialty_id
0,100000,218
1,100001,263
2,100001,436
3,100001,218
4,100001,481


## 3. Data Exploration

In [11]:
# Information of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
attorney_id     200000 non-null int64
specialty_id    200000 non-null int64
dtypes: int64(2)
memory usage: 3.1 MB


In [12]:
# Check missing values
data.isnull().sum()

attorney_id     0
specialty_id    0
dtype: int64

In [13]:
# Check duplicates
data.duplicated().sum()

0

In [15]:
# Check unique value count for the two ID's
data['attorney_id'].nunique(), data['specialty_id'].nunique()

(39369, 134)

In [16]:
# Check number of specialties per attorney
data.groupby('attorney_id')['specialty_id'].nunique().sort_values()

attorney_id
100000     1
115127     1
115125     1
148598     1
148606     1
          ..
149362    26
154524    26
147162    27
165340    28
157715    28
Name: specialty_id, Length: 39369, dtype: int64

The number of specialties of an attorney ranges from 1 to 28.

In [17]:
# View a sample: an attorney with 28 specialties
data[data['attorney_id']==157715]

Unnamed: 0,attorney_id,specialty_id
174885,157715,257
174886,157715,361
174887,157715,256
174888,157715,712
174889,157715,205
174890,157715,193
174891,157715,898
174892,157715,723
174893,157715,658
174894,157715,429


## Recommendation System

### Recommendation for Top K Practice Areas based on Similarity for Specialties

#### Step 1: Build the specialty-attorney matrix

In [18]:
# Build the specialty-attorney matrix
specialty_attorney = data.groupby(['specialty_id','attorney_id'])['attorney_id'].count().unstack(fill_value=0)
specialty_attorney = (specialty_attorney > 0).astype(int)
specialty_attorney

attorney_id,100000,100001,100002,100003,100005,100008,100010,100011,100012,100013,...,166168,166171,166173,166176,166177,166178,166179,166181,166183,166186
specialty_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
191,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
193,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
194,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
898,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
900,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
901,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
909,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Step 2: Build specialty-specialty similarity matrix

In [23]:
# Build specialty-specialty similarity matrix
specialty_attorney_norm = normalize(specialty_attorney, axis=1) 
similarity = np.dot(specialty_attorney_norm, specialty_attorney_norm.T)
df_similarity = pd.DataFrame(similarity, index=specialty_attorney.index, columns=specialty_attorney.index)

df_similarity

specialty_id,109,191,192,193,194,196,197,198,199,200,...,892,894,895,896,897,898,900,901,909,910
specialty_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,1.000000,0.006421,0.000000,0.010703,0.000000,0.000000,0.000000,0.023315,0.009182,0.245576,...,0.176335,0.000000,0.063357,0.005659,0.000000,0.009548,0.005326,0.030527,0.000000,0.000000
191,0.006421,1.000000,0.016280,0.018555,0.014162,0.014747,0.011247,0.016168,0.055182,0.020274,...,0.018771,0.014539,0.032952,0.008175,0.022923,0.079083,0.050781,0.045362,0.008045,0.021128
192,0.000000,0.016280,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.009312,0.000000,...,0.000000,0.034794,0.006426,0.000000,0.000000,0.000000,0.000000,0.000000,0.023531,0.000000
193,0.010703,0.018555,0.000000,1.000000,0.020656,0.018435,0.034370,0.022458,0.038030,0.019712,...,0.008940,0.024234,0.018308,0.008176,0.010917,0.015328,0.014107,0.014702,0.004470,0.000000
194,0.000000,0.014162,0.000000,0.020656,1.000000,0.021107,0.010732,0.025712,0.031390,0.003224,...,0.005118,0.017657,0.025154,0.003120,0.006250,0.015794,0.033771,0.024047,0.010235,0.005040
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
898,0.009548,0.079083,0.000000,0.015328,0.015794,0.021928,0.024157,0.018699,0.062592,0.023447,...,0.030570,0.018343,0.047183,0.012966,0.008116,1.000000,0.038898,0.042469,0.005317,0.018326
900,0.005326,0.050781,0.000000,0.014107,0.033771,0.016818,0.013993,0.024585,0.070414,0.012611,...,0.021130,0.029599,0.036441,0.013561,0.002716,0.038898,1.000000,0.038147,0.093416,0.017524
901,0.030527,0.045362,0.000000,0.014702,0.024047,0.023787,0.014004,0.053075,0.073154,0.053930,...,0.081047,0.026033,0.074600,0.018878,0.016681,0.042469,0.038147,1.000000,0.012749,0.016143
909,0.000000,0.008045,0.023531,0.004470,0.010235,0.000000,0.000000,0.003895,0.013804,0.004884,...,0.000000,0.011462,0.008467,0.004727,0.004733,0.005317,0.093416,0.012749,1.000000,0.000000


#### Step 3: Find the Top K most similar specialties

In [53]:
# Find the top k most similar specialties
def topk_specialty(specialty, similarity, k):
    result = similarity.loc[specialty].sort_values(ascending=False)[1:k + 1].reset_index()
    result = result.rename(columns={'specialty_id': 'Specialty_Recommend', specialty: 'Similarity'})
    return result

### Testing Recommender System based on Similarity
#### Process:
1. Ask user to input the ID of his/her obtained specialties
2. The system will recommend top 5 practice areas for the user's specialties based on similarity

In [54]:
# Test on a specialty sample 1
user_input1 = int(input('Please input your specialty ID: '))
recommend_user1 = topk_specialty(specialty=user_input1, similarity=df_similarity, k=5)
print('Top 5 recommended practice areas for user 1:')
print('--------------------------------------------')
print(recommend_user1)

Please input your specialty ID: 909
Top 5 recommended practice areas for user 1:
--------------------------------------------
   Specialty_Recommend  Similarity
0                  205    0.117307
1                  439    0.115609
2                  712    0.101321
3                  208    0.098208
4                  252    0.097700


In [58]:
# Test on a specialty sample 2
user_input2 = int(input('Please input your specialty ID: '))
recommend_user2 = topk_specialty(specialty=user_input2, similarity=df_similarity, k=5)
print('Top 5 recommended practice areas for user 2:')
print('--------------------------------------------')
print(recommend_user2)

Please input your specialty ID: 196
Top 5 recommended practice areas for user 2:
--------------------------------------------
   Specialty_Recommend  Similarity
0                  436    0.103643
1                  263    0.080463
2                  218    0.066211
3                  429    0.063597
4                  667    0.057946


### Popularity-based Recommendation - If user requests recommedation based on popularity

In [55]:
# Get ranked specialties based on popularity
df_specialty_popular = data_recommend.groupby('specialty_id')['attorney_id'].nunique().sort_values(ascending=False)
df_specialty_popular

specialty_id
218    14780
258     9856
429     9608
257     8615
373     8245
       ...  
430       38
679       32
884       29
423       17
192       14
Name: attorney_id, Length: 134, dtype: int64

In [56]:
# Top 5 specialties based on popularity among attorneys
df_specialty_popular.columns = ['specialty_id', 'count_popular']
print('The 5 most popular specialties:')
print('--------------------------------')
print(df_specialty_popular.nlargest(5, keep='all'))

The 5 most popular specialties:
--------------------------------
specialty_id
218    14780
258     9856
429     9608
257     8615
373     8245
Name: attorney_id, dtype: int64
