The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

https://paperswithcode.com/dataset/yelp

In [81]:
import pandas as pd
import numpy as np
import matplotlib as pyplot
import os

In [82]:
os.getcwd()

'c:\\Users\\yunus\\Desktop\\MSc\\Digital Driven Business\\Subjects\\System Development for Marketing\\RecSys\\Datasets2\\Yelp'

In [83]:
os.chdir(r'C:\Users\yunus\Desktop\MSc\Digital Driven Business\Subjects\System Development for Marketing\RecSys\Datasets2\Yelp')

Loading JSON files `df1`, `df2` and `df3` into a DataFrame

In [84]:
df1 = pd.read_json("yelp_academic_dataset_business.json", lines=True)

In [85]:
df2 = pd.read_json("yelp_academic_dataset_checkin.json", lines=True)

In [86]:
df3 = pd.read_json("yelp_academic_dataset_tip.json", lines=True)

## 1. Data visualization and Exploratory Data Analysis (EDA)

The `.shape` attribute of a DataFrame `df1`, `df2` and `df3` provides the dimensions of the DataFrame, returning a tuple representing the number of rows followed by the number of columns

In [87]:
# Check df1 shape
df1.shape

(150346, 14)

In [88]:
# Check df2 shape
df2.shape

(131930, 2)

In [89]:
# Check df3 shape
df3.shape

(908915, 5)

In [90]:
df1.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours'],
      dtype='object')

In [91]:
df2.columns

Index(['business_id', 'date'], dtype='object')

In [92]:
df3.columns

Index(['user_id', 'business_id', 'text', 'date', 'compliment_count'], dtype='object')

We've used chunksize for the remaining 2 JSON files, because the 'user' and 'review' datasets are ~ 3.2GB and 5.2GB which both are huge

In [93]:
#chunk_size = 200000
#chunks = pd.read_json("yelp_academic_dataset_user.json", lines=True, chunksize=chunk_size)

#for i, chunk in enumerate(chunks):
#    print(f"Processing Chunk {i+1}...")

Show the 1th chunk only. No need to go through the other chunks

In [94]:
# chunk_size = 200000
# Create an iterator to read the JSON file in chunks of the specified size
# chunks = pd.read_json("yelp_academic_dataset_user.json", lines=True, chunksize=chunk_size)

# Retrieve the first chunk directly from the iterator
# first_chunk_df = next(chunks)

Using a loop to iterate through the chunks eventually to 2th 

In [95]:
chunk_size = 150000
# Create an iterator to read the JSON file in chunks of the specified size
chunks = pd.read_json("yelp_academic_dataset_user.json", lines=True, chunksize=chunk_size)

# Initialize a variable to store the second chunk
second_chunk_df4 = None

# Loop through the chunks until the second one is reached
for i, chunk in enumerate(chunks, start=1):
    if i == 2:  # Check if it's the second chunk
        second_chunk_df4 = chunk
        break  # Exit the loop once the second chunk is found

In [96]:
chunk_size = 150000
# Create an iterator to read the JSON file in chunks of the specified size
chunks = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=chunk_size)

# Initialize a variable to store the second chunk
second_chunk_df5 = None

# Loop through the chunks until the second one is reached
for i, chunk in enumerate(chunks, start=1):
    if i == 2:  # Check if it's the second chunk
        second_chunk_df5 = chunk
        break  # Exit the loop once the second chunk is found

In [97]:
second_chunk_df4.columns

Index(['user_id', 'name', 'review_count', 'yelping_since', 'useful', 'funny',
       'cool', 'elite', 'friends', 'fans', 'average_stars', 'compliment_hot',
       'compliment_more', 'compliment_profile', 'compliment_cute',
       'compliment_list', 'compliment_note', 'compliment_plain',
       'compliment_cool', 'compliment_funny', 'compliment_writer',
       'compliment_photos'],
      dtype='object')

In [98]:
second_chunk_df4.shape

(150000, 22)

In [99]:
second_chunk_df4.columns

Index(['user_id', 'name', 'review_count', 'yelping_since', 'useful', 'funny',
       'cool', 'elite', 'friends', 'fans', 'average_stars', 'compliment_hot',
       'compliment_more', 'compliment_profile', 'compliment_cute',
       'compliment_list', 'compliment_note', 'compliment_plain',
       'compliment_cool', 'compliment_funny', 'compliment_writer',
       'compliment_photos'],
      dtype='object')

In [100]:
second_chunk_df5.shape

(150000, 9)

## Display each DataFrame

In [101]:
df1.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


- business_id: The unique identifier for the business
- name: The name of the business
- address: The street address of the business
- city: The city where the business is located
- state: The state in which the business is situated
- postal_code: The postal code for the location
- latitude: The latitude coordinate of the business location
- longitude: The longitude coordinate of the business location
- stars: The average rating of the business, on a scale of 1 to 5 stars
- review_count: The number of reviews the business has received
- is_open: Indicates whether the business is currently open (1) or closed (0)
- attributes: Any additional attributes of the business
- categories: The categories the business falls under
- hours: The operating hours of the business

In [117]:
nunique_df1_categories = df1['categories'].nunique()
nunique_df1_categories

83160

In [126]:
top_10_df1_categories = df1['categories'].value_counts()
top_10_df1_categories

categories
Beauty & Spas, Nail Salons                                                                                       1012
Restaurants, Pizza                                                                                                935
Nail Salons, Beauty & Spas                                                                                        934
Pizza, Restaurants                                                                                                823
Restaurants, Mexican                                                                                              728
                                                                                                                 ... 
Dermatologists, Health & Medical, Cosmetic Surgeons, Doctors, Acne Treatment, Skin Care, Beauty & Spas              1
Home Services, Home & Garden, Nurseries & Gardening, Hardware Stores, Shopping, Building Supplies, Appliances       1
Food Trucks, Smokehouse, Restaurants, Food, B

In [103]:
df2.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


- business_id: The unique identifier for the business
- date: A list of timestamps indicating specific dates and times

In [104]:
df3.head()

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0
2,-copOvldyKh1qr-vzkDEvw,MYoRNLb5chwjQe3c_k37Gg,It's open even when you think it isn't,2013-08-18 00:56:08,0
3,FjMQVZjSqY8syIO-53KFKw,hV-bABTK-glh5wj31ps_Jw,Very decent fried chicken,2017-06-27 23:05:38,0
4,ld0AperBXk1h6UbqmM80zw,_uN0OudeJ3Zl_tf6nxg5ww,Appetizers.. platter special for lunch,2012-10-06 19:43:09,0


- user_id: The unique identifier for the user
- business_id: The unique identifier for the business
- text: The text of the review or comment left by the user
- date: The timestamp when the review or comment was made
- compliment_count: The number of compliments the review or comment has received from other users

The `.shape` attribute of a DataFrame `second_chunk_df4` and `second_chunk_df5` provides the dimensions of the DataFrame, returning a tuple representing the number of rows followed by the number of columns. Since we took a chunk of `second_chunk_df4` & `second_chunk_df5` we know both columns consists of 200.000 rows. 

In [105]:
second_chunk_df4.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
150000,2iWMCiIIH1TTu_3FTy1nzg,Rick,11,2013-04-09 15:52:01,32,3,1,,,0,...,0,0,0,0,0,0,0,0,0,0
150001,R7SN0aR-fyOkaHIYHZlbUQ,Kathy,13,2013-04-02 22:07:24,16,1,1,,,0,...,0,0,0,0,0,0,0,0,0,0
150002,UH4UvNqQp7G7NCwWEEQ2QA,Allison,9,2012-12-14 05:30:26,3,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0
150003,0oGjOQzajgn9dXaXSyvLqQ,Paula,1,2015-06-14 16:04:25,0,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0
150004,QfTWh_GVsTGF0tLFQCeNCg,Joanna,14,2014-07-20 20:47:32,9,3,2,,,0,...,0,0,0,0,0,0,1,1,0,0


- user_id: The unique identifier assigned to the user 
- name: The name of the user
- review_count: The total number of reviews the user has written
- yelping_since: The date when the user joined or started yelping
- useful: The number of times the user's reviews have been marked as useful by others
- funny: The number of times the user's reviews have been marked as funny
- cool: The number of times the user's reviews have been marked as cool
- elite: Indicates the years the user was recognized as 'elite' on the platform
- friends: The number of friends the user has on the platform
- fans: The number of users who are fans of this particular user
- compliment_more: Additional compliments not specified by other categories.
- compliment_profile: Compliments received for the user's profile.
- compliment_cute: Compliments received for being cute.
- compliment_list: Compliments received for lists the user has created.
- compliment_note: Compliments received for notes or messages left by the user.
- compliment_plain: Compliments received for plain text reviews.
- compliment_cool: Compliments received for cool reviews.
- compliment_funny: Compliments received for funny reviews.
- compliment_writer: Compliments received for the user's writing ability.
- compliment_photos: Compliments received for photos the user has posted

In [106]:
second_chunk_df5.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
150000,EGur4Zepuqb0EHiI9PsMuQ,5l83SrNyEylo5AhHM2dv9Q,vUrTGX_7HxqeoQ_6QCVz6g,5,0,0,0,Awesome for Philly to have such a delicious an...,2018-01-28 20:32:57
150001,pVrDxVf2gr4i90DcpxSu7A,wXdbkFZsfDR7utJvbWElyA,XX2PSfT4xuHq0yuoPIge1A,5,0,0,0,This is the third of the 5 local (soon to be 6...,2018-03-03 11:15:52
150002,tkrD0V2SPAUi7BGBJ5cCZA,QWbqFXj_Tx_tc98AFTYkQA,CKgsvMxnFVoph6IluQevWg,5,0,0,0,I am from New York. The hand tossed pizza at M...,2016-02-08 04:33:32
150003,yRVOJluGeb5sBnCXN0K80A,w7oU_1aKc1rluXIOhtV2wA,tQ6VNQ9ezkxRE4bvu9WShQ,1,0,0,0,We were greeted fast. It went downhill from t...,2014-07-23 02:30:52
150004,8xz9wueeuC6rVgDorKoUOA,GXfHd3ZuJZ8E7YYwD0cyUg,JUlsvVAvZvGHWFfkKm0nlg,5,3,0,2,This restaurant is located right across from P...,2016-02-06 23:27:30


- review_id: The unique identifier assigned to the review
- user_id: The unique identifier for the user who wrote the review
- business_id: The unique identifier for the business being reviewed
- stars: The star rating given by the user for the business, on a scale of 1 to 5
- useful: The number of times this review has been marked as useful by other users
- funny: The number of times this review has been marked as funny by other users
- cool: The number of times this review has been marked as cool by other users
- text: The text of the review, providing the user's feedback on the business
- date: The date and time when the review was posted

## Merging

In [107]:
# merge df1 and df3
merged_df1_df3 = pd.merge(df1.iloc[:150000], df3.iloc[:150000], on='business_id')

In [129]:
merged_df1_df3.head(5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,user_id,text,date,compliment_count
0,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",trf3Qcz8qvCDKXiTgjUcEg,Dropping off my Amazon return.,2011-12-12 23:30:26,0
1,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",_5swqa5xUdLar-Q-bBZSDA,Containers!,2012-03-29 18:47:55,0
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",oAvO0BOHOagOI7WVGXlWSA,This place looks the same as other target at c...,2012-12-11 02:50:41,0
3,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",moSLKqdFUI-B80vun67UfQ,"clean just stopped for some pens, not to busy ...",2014-09-21 23:01:02,0
4,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...",WqeE5e5ROfaVEgkb9dAkiQ,Love their pastries and drinks!,2017-09-20 17:00:27,0


In [131]:
# Apply the mapping function to the 'text' column of merged_df1_df3
merged_df1_df3['Broad Category'] = merged_df1_df3['text'].apply(map_text_to_category)

# Display the first few rows of the DataFrame to verify the new column
merged_df1_df3.head()


NameError: name 'map_text_to_category' is not defined

In [148]:
# Updated mapping dictionary with the "Fashion" category
text_category_keywords = {
    "Retail": [
        'store', 'shopping', 'department', 'merchandise', 'retail', 'shop', 'outlet', 'boutique', 'mall',
        'marketplace', 'goods', 'products'
    ],
    "Food and Beverages": [
        'restaurant', 'eatery', 'cafe', 'bakery', 'dining', 'food', 'drink', 'coffee', 'tea', 'bar', 'grill', 'diner',
        'bistro', 'kitchen', 'pastry', 'cuisine', 'meal', 'menu'
    ],
    "Services": [
        'service', 'shipping', 'notary', 'mailbox', 'repair', 'laundry', 'cleaning', 'maintenance', 'consulting',
        'legal', 'financial', 'insurance', 'healthcare', 'therapy', 'counseling'
    ],
    "Health and Beauty": [
        'spa', 'salon', 'gym', 'fitness', 'health', 'beauty', 'medical', 'wellness', 'clinic', 'cosmetic', 'hair', 'nail',
        'yoga', 'pilates', 'therapy', 'treatment', 'massage', 'skincare'
    ],
    "Entertainment and Leisure": [
        'movie', 'theater', 'park', 'museum', 'entertainment', 'leisure', 'recreation', 'amusement', 'concert', 'gallery',
        'exhibit', 'festival', 'event', 'attraction', 'sightseeing'
    ],
    "Automotive": [
        'auto', 'car', 'vehicle', 'repair', 'dealership', 'mechanic', 'garage', 'maintenance', 'parts', 'service', 'oil',
        'tire', 'detailing', 'bodywork', 'wash'
    ],
    "Education": [
        'school', 'education', 'tutoring', 'learning', 'class', 'course', 'university', 'college', 'institute', 'seminar',
        'workshop', 'training', 'academy', 'study', 'scholarship'
    ],
    "Fashion": [
        'ashion', 'apparel', 'clothing', 'accessories', 'style', 'boutique', 'designer', 'footwear', 'shoes', 'jewelry',
        'trend', 'outfit', 'wardrobe', 'collection', 'wear'
    ]
}

# The mapping function remains the same
def map_text_to_category(text):
    lower_text = text.lower()  # Convert text to lowercase for case-insensitive matching
    assigned_categories = []
    for broad_category, keywords in text_category_keywords.items():
        if any(keyword in lower_text for keyword in keywords):
            assigned_categories.append(broad_category)
    
    # Handle cases where the text does not clearly fit into any predefined category
    if not assigned_categories:
        assigned_categories.append("Other")
    
    return ', '.join(assigned_categories)

# Apply the mapping function to the 'text' column of merged_df1_df3 as before
merged_df1_df3['Broad Category'] = merged_df1_df3['text'].apply(map_text_to_category)

# Display the first few rows to verify the new column
merged_df1_df3.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,user_id,text,date,compliment_count,Broad Category,Combined Info
0,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",trf3Qcz8qvCDKXiTgjUcEg,Dropping off my Amazon return.,2011-12-12 23:30:26,0,Other,"Shipping Centers, Local Services, Notaries, Ma..."
1,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",_5swqa5xUdLar-Q-bBZSDA,Containers!,2012-03-29 18:47:55,0,Other,"Department Stores, Shopping, Fashion, Home & G..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",oAvO0BOHOagOI7WVGXlWSA,This place looks the same as other target at c...,2012-12-11 02:50:41,0,"Retail, Food and Beverages","Department Stores, Shopping, Fashion, Home & G..."
3,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",moSLKqdFUI-B80vun67UfQ,"clean just stopped for some pens, not to busy ...",2014-09-21 23:01:02,0,Other,"Department Stores, Shopping, Fashion, Home & G..."
4,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...",WqeE5e5ROfaVEgkb9dAkiQ,Love their pastries and drinks!,2017-09-20 17:00:27,0,Food and Beverages,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."


In [149]:
profile_merged_df1_df3 = merged_df1_df3['Broad Category'].value_counts()
profile_merged_df1_df3

Broad Category
Other                                                                                   98515
Food and Beverages                                                                      23346
Services, Automotive                                                                     4451
Food and Beverages, Services, Automotive                                                 3847
Automotive                                                                               3788
                                                                                        ...  
Services, Health and Beauty, Automotive, Education                                          1
Food and Beverages, Health and Beauty, Automotive, Fashion                                  1
Services, Entertainment and Leisure, Automotive, Education                                  1
Retail, Services, Entertainment and Leisure                                                 1
Retail, Food and Beverages, Health and Beauty

In [108]:
# Merge 'merged_df1_df3' with 'second_chunk_df4' on 'user_id', limiting to the first 150,000 rows
merged_df1_df3_df5 = pd.merge(merged_df1_df3, second_chunk_df5.iloc[:150000], on='business_id')

In [109]:
merged_df1_df3_df5.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars_x', 'review_count', 'is_open',
       'attributes', 'categories', 'hours', 'user_id_x', 'text_x', 'date_x',
       'compliment_count', 'review_id', 'user_id_y', 'stars_y', 'useful',
       'funny', 'cool', 'text_y', 'date_y'],
      dtype='object')

In [110]:
second_chunk_df4.columns

Index(['user_id', 'name', 'review_count', 'yelping_since', 'useful', 'funny',
       'cool', 'elite', 'friends', 'fans', 'average_stars', 'compliment_hot',
       'compliment_more', 'compliment_profile', 'compliment_cute',
       'compliment_list', 'compliment_note', 'compliment_plain',
       'compliment_cool', 'compliment_funny', 'compliment_writer',
       'compliment_photos'],
      dtype='object')

In [111]:
merged_df1_df3_df5 = merged_df1_df3.rename(columns={'user_id_y': 'user_id'})

In [112]:
merged_df1_df3_df5.head(1)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,user_id,text,date,compliment_count
0,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",trf3Qcz8qvCDKXiTgjUcEg,Dropping off my Amazon return.,2011-12-12 23:30:26,0


In [113]:
# Merge the resulting 'merged_df1_df3_df4' with 'second_chunk_df5' on 'user_id', again limiting to the first 150,000 rows
final_merged_df = pd.merge(merged_df1_df3_df5, second_chunk_df4.iloc[:150000], on='user_id')

In [114]:
final_merged_df.columns

Index(['business_id', 'name_x', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count_x', 'is_open',
       'attributes', 'categories', 'hours', 'user_id', 'text', 'date',
       'compliment_count', 'name_y', 'review_count_y', 'yelping_since',
       'useful', 'funny', 'cool', 'elite', 'friends', 'fans', 'average_stars',
       'compliment_hot', 'compliment_more', 'compliment_profile',
       'compliment_cute', 'compliment_list', 'compliment_note',
       'compliment_plain', 'compliment_cool', 'compliment_funny',
       'compliment_writer', 'compliment_photos'],
      dtype='object')