In [1]:
import pandas as pd

# Merging DataFrames Notebook

For the career village recommender system, we were given 15 CSV's to work with. To make future processing easier and to find significant relationships, we're merging relevant DataFrames. This notebook is specifically dedicated to merging DataFrames. After opening and playing around with the dataframes I realized that merging them wouldn't be as simple as I had hoped. The dataframes don't line up well, and when merging I'd either lose or make up information. In this notebook, I did an initial run at merging the notebooks so that I could delve further into the data, do some cleaning and EDA. After cleaning and EDA, I tweaked how I merged the data for optimal information. For best results, it'd be beneficial to spend more time fitting the datasets together with minimal loss or made-up information. 

For general cleaning, EDA, and modeling I merged:
* answers.csv
* answer_scores.csv
* question_scores.csv
* questions.csv
* tag_questions.csv
* tags.csv
* professionals.csv

## <span style = 'color:red'> include image
    
To do this I did, I first merged the answers CSV's together, the just the questions CSVs, then just the tag CSVs. Once I had three merged notebooks then I merged those notebooks. 

## <span style = 'color:red'> explain how


To further explore the relationships tags have on how questions being answered I merged:
(an to avoid issues of making up or losing information)
## <span style = 'color:red'> List
    
With the data provided, there was only a sparse data dictionary with just a note about what the notebook was. There wasn't an explicit explanation for what the specific columns meant and what was compatible with what. After playing around with a few of the notebooks I assumed that **these** columns were compatible with each other and merged on those columns. 

#### CSV's Used:

In [2]:
# Answers
answers_df = pd.read_csv('./Datasets/answers.csv')
answer_score_df = pd.read_csv('./Datasets/answer_scores.csv')

# Questions
q_score_df = pd.read_csv('./Datasets/question_scores.csv')
ques_df = pd.read_csv('./Datasets/questions.csv')

# Tags
tag_q_df = pd.read_csv('./Datasets/tag_questions.csv')
tags_df = pd.read_csv('./Datasets/tags.csv')

#Professional
pros_df = pd.read_csv('./Datasets/professionals.csv')

#### Unused CSV's

Given limited time, I only worked with 7/15 notebooks. I didn't include the tag_users.csv because it wasn't compatible with how I merged the other columns and would have been making up information that wasn't necessarily true. The rest of the dataframes weren't used because they weren't important for the specific problem I was seeking to solve and because merging them would skew the information.

In [3]:
#Unused Tags
tag_users_df = pd.read_csv('./Datasets/tag_users.csv')

# Others
coms_df = pd.read_csv('./Datasets/comments.csv')
emails_df = pd.read_csv('./Datasets/emails.csv')
group_mem_df = pd.read_csv('./Datasets/group_memberships.csv')
groups_df = pd.read_csv('./Datasets/groups.csv')
school_mem_df = pd.read_csv('./Datasets/school_memberships.csv')
students_df = pd.read_csv('./Datasets/students.csv')
matches_df = pd.read_csv('./Datasets/matches.csv')

#### Throughout this notebook you'll notice I do the same thing for each merge
1. View the notebooks first few rows
2. Check the shape of the dataframe (to ensure it merged the way expected)
3. Rename columns (so that when merging the column names match). *Again, I had to assume these columns were compatiable based on examining the data.*
4. Rename columns if the column name will be confusing once merged (i.e. there' a 'score' column in both the questions and answers df)

4. Merging, all the merges are merged on a column shared by both dataframes with a left merge 

* At the end of the notebook I export it as a CSV to the Datasets file

<span style ='color:red'> pic of left merge

## Merging Dataframes with Answers Information

#### `answers_df`

In [4]:
print("shape:", answers_df.shape)
answers_df.head(2)

shape: (51123, 5)


Unnamed: 0,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body
0,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...
1,ada720538c014e9b8a6dceed09385ee3,2aa47af241bf42a4b874c453f0381bd4,eb80205482e4424cad8f16bc25aa2d9c,2018-05-01 14:19:08 UTC+0000,<p>Hi. I joined the Army after I attended coll...


#### `answer_scores_df`

In [5]:
print("shape:", answer_score_df.shape)
answer_score_df.head()

shape: (51138, 2)


Unnamed: 0,id,score
0,7b2bb0fc0d384e298cffa6afde9cf6ab,1
1,7640a6e5d5224c8681cc58de860858f4,5
2,3ce32e236fa9435183b2180fb213375c,2
3,fa30fe4c016043e382c441a7ef743bfb,0
4,71229eb293314c8a9e545057ecc32c93,2


#### Renaming `id` to `answer_id` to match the answers dataframe for merging

In [6]:
answer_score_df.rename(columns={'id': 'answers_id'}, inplace=True)
answer_score_df.head(1)

Unnamed: 0,answers_id,score
0,7b2bb0fc0d384e298cffa6afde9cf6ab,1


#### Merging `answers_df` and `answer_scores`  dataframes on `answer_id`, how ='left', and saving new dataframe as `df_answer`

In [7]:
# Merge two Dataframes on single column 'answers_id'
df_answer = answers_df.merge(answer_score_df, on='answers_id', how='left')
print("shape:", df_answer.shape)
df_answer.head(2)

shape: (51123, 6)


Unnamed: 0,answers_id,answers_author_id,answers_question_id,answers_date_added,answers_body,score
0,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,0.0
1,ada720538c014e9b8a6dceed09385ee3,2aa47af241bf42a4b874c453f0381bd4,eb80205482e4424cad8f16bc25aa2d9c,2018-05-01 14:19:08 UTC+0000,<p>Hi. I joined the Army after I attended coll...,0.0


#### Renaming `score` to `answer_score` to avoid confusion in later merges

In [8]:
df_answer.rename(columns = {'score':'answers_score'}, inplace=True)

---
# Merging DataFrames with Question Information:

#### `q_score_df`

In [9]:
print("shape:", q_score_df.shape)
q_score_df.head(2)

shape: (23928, 2)


Unnamed: 0,id,score
0,38436aadef3d4b608ad089cf53ab0fe7,5
1,edb8c179c5d64c9cb812a59a32045f55,4


#### Renaming the `id` column in `q_score_df` to match the `questions_id` column in `ques_df` so we can merge the two DataFrames on `questions_id`

In [10]:
q_score_df.rename(columns={'id':'questions_id'}, inplace= True)

#### `ques_df`

In [11]:
print("shape:", ques_df.shape)
ques_df.head(2)

shape: (23931, 5)


Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...


#### Creating a new DataFrame `questions_df` from the merged `ques_df` and `q_score_df`

In [12]:
df_questions = ques_df.merge(q_score_df, on = 'questions_id', how='left')
print("shape:", df_questions.shape)
df_questions.head(2)

shape: (23931, 6)


Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,score
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...,5.0


---
---
---

## Merging the Tag Dataframes

### <span style = 'color:red'> if I decide that I'll only use 2/3 tag csv's make sure to note why

Merging the `tag_q_df` and the `tags_df`. We're merging on the `tag_id` with a left merge. We're doing this to keep all the information in the tag_q_df while matching up (and sometimes duplicating) information in the tags_df. 

We're note merging the `tag_users` dataframe because 1) the information isn't crucial to later EDA and modeling and more importantly 2) because we'd either be making up information, duplicating rows that shouldn't be duplicated, or loosing significant amounts of information. Since it doesn't add that much value, we're choosing to not merge the dataframe and leave it as a standalone file. 

#### <span style = 'color:red'> insert image is we can

#### Renaming the `tag_questions_question_id` column in `tag_q_df` to match the `questions_id` column in `questions_df` so we can merge the two DataFrames on `questions_id`

In [13]:
tag_q_df.head(2)

Unnamed: 0,tag_questions_tag_id,tag_questions_question_id
0,28930,cb43ebee01364c68ac61d347a393ae39
1,28930,47f55e85ce944242a5a347ab85a8ffb4


In [14]:
tag_q_df.shape

(76553, 2)

<span style ='color:red'> So we have 23K unique question id's, but the data frame is 76k long. If a question id has multiple tags then they'll be stored in multiple rows 

In [15]:
tag_q_df.nunique()

tag_questions_tag_id          7091
tag_questions_question_id    23288
dtype: int64

In [16]:
tag_q_df['tag_questions_question_id'].sort_values()

12839    0003e7bf48f24b5c985f8fce96e611f3
59166    0003e7bf48f24b5c985f8fce96e611f3
48543    0003e7bf48f24b5c985f8fce96e611f3
29372    0003e7bf48f24b5c985f8fce96e611f3
23735    0003e7bf48f24b5c985f8fce96e611f3
                       ...               
27967    fffc471e892a4b4e826858426da79b7e
10242    fffde8d0b28247b8a3dd635ba792df04
44576    fffde8d0b28247b8a3dd635ba792df04
25971    fffde8d0b28247b8a3dd635ba792df04
5248     fffde8d0b28247b8a3dd635ba792df04
Name: tag_questions_question_id, Length: 76553, dtype: object

Through exploring the data we can see that if a question id has more than one tag, there will be a duplicate of the question id for each new tag it has. If we left this, when merging we'd be lossing information or duplicating rows with 90% similiar information. To consolidate space without loosing information, we're creating a column where each cell has a list of all the tag ids

## <span style = 'color:red'> in looking at this, we should probably merge the tag_names and the tag_id's, and when doing this append both the tag_names and the tag_ids

### If I groupby "tag_questions_question_id" it's not too hard to make a new column with the values from all the others

In [17]:
cross_tab = pd.crosstab(tag_q_df.tag_questions_question_id, tag_q_df.tag_questions_tag_id, margins=True, margins_name="Total")

In [18]:
cross_tab.sort_values('Total', ascending=False)

tag_questions_tag_id,27,29,36,42,46,51,53,54,55,60,...,39110,39141,39142,39165,39182,39183,39248,39249,39250,Total
tag_questions_question_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total,134,520,214,173,285,464,701,1083,505,21,...,1,1,1,1,1,4,1,2,1,76553
e79bf4570af646d5892cf42b031c2a52,0,0,0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,54
2ea130631ba34b4181c5fd85816504cf,0,0,0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,53
e1860d4512b746a19270e5675efb7b44,0,0,0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,52
164522e7595649729deebf48cad87e1b,0,0,0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82fac2a6aedf4d8cbd45cb563fa6b9ad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
a0e2d80dfa4b4843bac5c78c77da7eff,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
bc55024b0b0346e09965cb63b360f34f,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8313e74cdfe44949895fe0708692755d,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [19]:
# stop

## <span style ='color:red'> Finish

### Working on a way to get all the tag_id's into the same row instrad of having duplicates:

In [20]:
# id_num = None #creating a variable that will be updated through iterations in the for loop

# counter = 0 
# # tag_q_df['tags'] = None

# for ids in tag_q_df['tag_questions_question_id'].sort_values():
# #     print("ids", ids)
# #     print(id_num)
#     counter += 1
#     print(counter)
    
#     if ids == id_num:
# #         print("yes")
#         tag_q_df['tags'] += tag_q_df['tag_questions_question_id'] # 
#         #add ids to tag_q_df, each row should be a column
    
# #         id_num = ids #update i to equal the last ids so it can compare 

# #since ids == id_num, we don't have to update the variable, it's already the same

#     else:
#         id_num = ids #update id_num to equal the last ids so it can compare to the next iteration

# #         continue



### <span style = 'color:red'> test

In [21]:
# small = tag_q_df.sort_values('tag_questions_question_id').head(25).drop(columns = 'tags')

In [22]:
# id_num = None #creating a variable that will be updated through iterations in the for loop

# counter = 0 
# # tag_q_df['tags'] = None

# for ids in tag_q_df['tag_questions_question_id'].sort_values():
# #     print("ids", ids)
# #     print(id_num)
#     counter += 1
#     print(counter)
    
#     if ids == id_num:
# #         print("yes")
#         tag_q_df['tags'] += tag_q_df['tag_questions_question_id'] # 
#         #add ids to tag_q_df, each row should be a column
    
# #         id_num = ids #update i to equal the last ids so it can compare 

# #since ids == id_num, we don't have to update the variable, it's already the same

#     else:
#         id_num = ids #update id_num to equal the last ids so it can compare to the next iteration

# #         continue



In [23]:
# tag_q_df['tags'].unique()

In [24]:
tags_df.head(2)

Unnamed: 0,tags_tag_id,tags_tag_name
0,27490,college
1,461,computer-science


In [25]:
tags_df.shape

(16269, 2)

#### Renaming `tag_questions_tag_id` to `tag_id` in `tag_q_df` and `tags_tag_id` in `tags_df` so they match and we can merge on them. 

In [26]:
tag_q_df.rename(columns ={'tag_questions_tag_id': 'tag_id'}, inplace = True)
tags_df.rename(columns={"tags_tag_id": "tag_id",  'tags_tag_name':'tag_name'}, inplace=True)

#### Merging the tag columns on `tag_id`. We doing a left merge because we want to keep all the information in the `tag_q_df` and don't mind if the `tags_df` is duplicated or deleted.

In [27]:
df_tags = tag_q_df.merge(tags_df, on = "tag_id", how="left")

In [28]:
df_tags.shape

(76553, 3)

#### Exporting the tags dataframe so we can do specific EDA on the tags without messing up the other dataframes or merged data

In [29]:
df_tags.to_csv('./Datasets/merged_tag_df.csv',index=False)

#### Renaming the `tag_questions_question_id` column to `questiosn_id` so that it's compatible to the quetions 

In [30]:
df_tags.rename(columns={'tag_questions_question_id': "questions_id"}, inplace = True)

---
---
---

### Merging Tags to Questions Dataframes

In [31]:
tag_ques_df = df_questions.merge(df_tags, on = 'questions_id', how='left')

In [32]:
tag_ques_df.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,score,tag_id,tag_name
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,14147.0,lecture
1,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,27490.0,college


#### Renaming `score` to `questions_score` to avoid confusion when merging `tag_ques_df` and answers DataFrames

In [33]:
tag_ques_df.rename(columns = {'score':'questions_score'}, inplace=True)

In [34]:
tag_ques_df.shape

(77196, 8)

## Merging `tag_ques_df` to the `df_answers` dataframe and saving as merged

#### Renaming `answers_question_id` to `question_id`  so we can merge the questions and answers `DataFrames`

In [35]:
df_answer.rename(columns = {'answers_question_id' : 'questions_id'}, inplace=True)

#### Merged questions and asnwers DataFrames

In [36]:
df_answer.shape

(51123, 6)

In [37]:
df_answer.head(2)

Unnamed: 0,answers_id,answers_author_id,questions_id,answers_date_added,answers_body,answers_score
0,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,0.0
1,ada720538c014e9b8a6dceed09385ee3,2aa47af241bf42a4b874c453f0381bd4,eb80205482e4424cad8f16bc25aa2d9c,2018-05-01 14:19:08 UTC+0000,<p>Hi. I joined the Army after I attended coll...,0.0


In [38]:
tag_ques_df['questions_id'].nunique()

23931

In [39]:
tag_ques_df.shape

(77196, 8)

## <span style = 'color:red'> test: We're making it a left merge on the questions df

In [40]:
# merged_df = df_answer.merge(tag_ques_df, on='questions_id', how='left', indicator=True)
# merged_df.head(2)

In [41]:
merged_df = tag_ques_df.merge(df_answer, on='questions_id', how='left')
merged_df.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,questions_score,tag_id,tag_name,answers_id,answers_author_id,answers_date_added,answers_body,answers_score
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,14147.0,lecture,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,0.0
1,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1.0,27490.0,college,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...,0.0


In [42]:
merged_df.shape

(180376, 13)

In [43]:
merged_df.isnull().sum()

questions_id               0
questions_author_id        0
questions_date_added       0
questions_title            0
questions_body             0
questions_score           26
tag_id                  1700
tag_name                1700
answers_id              2340
answers_author_id       2340
answers_date_added      2340
answers_body            2344
answers_score           2386
dtype: int64

In [44]:
merged_df.duplicated().sum()

0

## <span style = 'color:red'> This size isn't right, it should be about 55k. For some reasons it's adding both the questions and asnwers data frames. 

#### Exporting as merged csv:

In [45]:
merged_df.to_csv("./Datasets/merged_df.csv", index=False)

In [46]:
merged_df.shape

(180376, 13)

---
---
---
After doing some EDA, I realized that a good dataframe to EDA on would be `answers_df`, `ques_df`, and `tag_q_df`

In [47]:
answers_df.head(2)
answers_df.rename(columns={'answers_question_id' :'questions_id'}, inplace=True)

In [48]:
ques_df.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...


In [49]:
qa_merge = ques_df.merge(answers_df, on = 'questions_id', how = 'left')

In [50]:
qa_merge.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,answers_id,answers_author_id,answers_date_added,answers_body
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,4e5f01128cae4f6d8fd697cec5dca60c,36ff3b3666df400f956f8335cf53e09e,2016-04-29 19:40:14 UTC+0000,<p>Hi!</p>\n<p>You are asking a very interesti...
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...,ada720538c014e9b8a6dceed09385ee3,2aa47af241bf42a4b874c453f0381bd4,2018-05-01 14:19:08 UTC+0000,<p>Hi. I joined the Army after I attended coll...


In [51]:
qa_merge.isnull().sum()

questions_id              0
questions_author_id       0
questions_date_added      0
questions_title           0
questions_body            0
answers_id              821
answers_author_id       821
answers_date_added      821
answers_body            822
dtype: int64

In [52]:
qa_merge['was_answered'] = qa_merge['answers_id'].notnull().astype(int)

In [53]:
qa_merge = qa_merge.drop(columns=['answers_id', 'answers_author_id','answers_date_added', 'answers_body'])

In [54]:
qa_merge.head()

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,was_answered
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...,1
2,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...,1
3,4ec31632938a40b98909416bdd0decff,f2c179a563024ccc927399ce529094b5,2017-02-08 19:13:38 UTC+0000,Will going abroad for your first job increase ...,I'm planning on going abroad for my first job....,1
4,2f6a9a99d9b24e5baa50d40d0ba50a75,2c30ffba444e40eabb4583b55233a5a4,2017-09-01 14:05:32 UTC+0000,To become a specialist in business management...,i hear business management is a hard way to ge...,1


In [55]:
tag_q_df.head()

Unnamed: 0,tag_id,tag_questions_question_id
0,28930,cb43ebee01364c68ac61d347a393ae39
1,28930,47f55e85ce944242a5a347ab85a8ffb4
2,28930,ccc30a033a0f4dfdb2eb987012f25792
3,28930,e30b274e48d741f7bf50eb5e7171a3c0
4,28930,3d22742052df4989b311b4195cbb0f1a


In [56]:
tag_q_df['has_tag'] = tag_q_df['tag_questions_question_id'].notnull().astype(int)

In [57]:
has_tag_df = tag_q_df.drop(columns = 'tag_id')
has_tag_df = has_tag_df.drop_duplicates()

In [58]:
print("shape:", has_tag_df.shape)
has_tag_df.head(2)

shape: (23288, 2)


Unnamed: 0,tag_questions_question_id,has_tag
0,cb43ebee01364c68ac61d347a393ae39,1
1,47f55e85ce944242a5a347ab85a8ffb4,1


In [60]:
has_tag_df.rename(columns = {'tag_questions_question_id': 'questions_id'}, inplace=True)

In [61]:
merged_qa_tag_df = qa_merge.merge(has_tag_df, on="questions_id", how='left')

In [62]:
merged_qa_tag_df['has_tag'].nunique()

1

#### Exporting `merged_qa_tag_df` to csv for EDA

In [None]:
merged_qa_tag_df.to_csv('./Datasets/merged_qa_tag_df.csv', index=False)