# tidy Challenge

## Serena Gestring - 2/2/25

To begin, I will import the pandas module and import the json file I will be working with throughout this challenge.

In [1]:
import pandas as pd
import json

In [2]:
with open('chessbuds_messages.json') as c:
    chess = json.load(c)

In [3]:
type(chess)

dict

### Creating A DataFrame

Before creating a DataFrame, I want to better understand the structure of the dictionary, so I checked the keys. 

In [4]:
chess.keys()

dict_keys(['participants', 'messages', 'title', 'is_still_participant', 'thread_type', 'thread_path', 'magic_words', 'joinable_mode'])

From here I want to investigate the keys to see what data is stored in each so I can figure out what the best observational unit for my final tidy DataFrame should be.

In [5]:
chess['participants']

[{'name': 'Scott Pence'},
 {'name': 'Chad Larson'},
 {'name': 'Joanna Rusch'},
 {'name': 'Angela Babbitt Pence'},
 {'name': 'David Silva'},
 {'name': 'Aaron Rusch'},
 {'name': 'Timothy Vanderpool'}]

I did run the below cell but because the output was so long I cleared it so that it would not show up once I upload this jupyter notebook to GitHub.

In [None]:
chess['messages']

In [7]:
chess['title']

'Chess Buds'

In [8]:
chess['is_still_participant']

True

In [9]:
chess['thread_type']

'RegularGroup'

In [10]:
chess['thread_path']

'inbox/chessbuds_npjakt9u1g'

In [11]:
chess['magic_words']

[]

In [12]:
chess['joinable_mode']

{'mode': 1, 'link': ''}

Even though I cleared the code cell, the messages key seems to be the only one with any interesting data, so I am going to create and tidy a DataFrame for the messages (my observational unit). Below is the first five rows of my initial messy DataFrame so I can see what columns are already tidy and which need to be tidied.   

In [13]:
messages_df = pd.DataFrame(chess['messages'])
messages_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,


Already, it does appear that each row contains one observation (an individual message), so that does not need tidied. The 'sender_name' attribute (column) seems to only contain one name/value per observation, which makes sense because one message isn't going to be sent from multiple senders, so that is tidy. This also seems to be the case for the 'timestamp_ms,' 'content,' 'type,' 'is_unsent,' and 'is_taken_down' attributes as well. You can also tell because none of these cells contain a dictionary ({}) or a list ([]). The presence of either indiciates there are multiple values/attributes stored in one cell, and therefore does not meet tidy principles. This *is* the case for the 'reactions' and 'bumped_message_metadata' attributes. I am unsure about the 'share,' 'photos,' 'gifs,' and 'users' attributes because for the first 5 observations they are all empty values, so I will have to further investigate those attributes to see if they will need tidied. 

### Manipulating the DataFrame

Before I begin to manipulate and tidy the data frame, I need to learn more about some of these attributes and how they are structured. 

Below I am investigating 'reactions' further. Some row values are empty, but those that are not contain a list which then contains a dictionary (or multiple) that defines a 'reaction' and an 'actor.' The 'reaction' data seems to be stored as symbols, which really don't mean anything to me so I am deciding that for the 'reactions' attribute I am going to pull just the 'actor,' or the people who reacted to the message, from this attribute and add it as its own column to the full DataFrame.

In [14]:
messages_df['reactions']

0      [{'reaction': 'ð', 'actor': 'Chad Larson'},...
1      [{'reaction': 'ð', 'actor': 'Scott Pence'},...
2      [{'reaction': 'ð', 'actor': 'Scott Pence'},...
3      [{'reaction': 'ð', 'actor': 'Chad Larson'},...
4                                                    NaN
                             ...                        
218                                                  NaN
219       [{'reaction': 'ð', 'actor': 'Chad Larson'}]
220                                                  NaN
221                                                  NaN
222                                                  NaN
Name: reactions, Length: 223, dtype: object

In [15]:
messages_df['reactions'][0]

[{'reaction': 'ð\x9f\x91\x8d', 'actor': 'Chad Larson'},
 {'reaction': 'ð\x9f\x91\x8d', 'actor': 'Chad Larson'}]

In [16]:
messages_df['reactions'][0][0]['actor']

'Chad Larson'

Below is my attempt at trying to use list comprehension to pull the 'actor' data from the 'reactions' dictionaries, but I got this error:

In [17]:
[[y['actor'] for y in x] for x in messages_df['reactions']]

TypeError: 'float' object is not iterable

So I investigated further using type. I did run the below cell but I cleared its contents since it was really long. But what I found was two types of structures: list and float. Google told me the float is a data type that represents decimal numbers. This confused me and I did not know what to do.

In [None]:
[type(x) for x in messages_df['reactions']]

Then I thought that since some messages have multiple 'actors' reacting to it, it wouldn't be tidy to have actors as its own column anyway since some would have multiple values in one cell. So then I thought maybe I should try just using 'len' like we did in class to get a numeric value for how many 'actors' reacted to each message because that would give us one value per message. However, I got a similar error message that float is not iterable. I tried looking at the documentation to figure out how to deal with floats, but I kept seeing 'floating-point' and 'PyFloat,' and nothing was helpful. 

In [19]:
[[len(y['actor']) for y in x] for x in messages_df['reactions']]

TypeError: 'float' object is not iterable

Without knowing what else to try, I have decided just to drop the 'reactions' attribute altogether and move on to the rest of the DataFrame.

In [20]:
messages_df.drop('reactions', axis=1, inplace=True)

In [21]:
messages_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,


Next, I want to investigate the 'bumped_message_metadata' attribute. This attribute contains a dictionary that define two things: the bumped message (which seems to be the same as the 'content' attribute) and then an 'is_bumped' attribute that can be either True or False.

In [22]:
messages_df['bumped_message_metadata'][0]

{'bumped_message': "Maybe he just wants to ride the publicity for a bit longer, even if he doesn't get any money from the lawsuit. Like, I didn't know his name before this but I certainly do now.",
 'is_bumped': False}

Since I felt it would be redundant to include 'bumped_message' when it seems to be the same as the 'content' attribute, I thought I would try a similar approach as before by just pulling out the 'is_bumped' attribute for each message to make it its own column with one value: True or False. But I got this error:

In [23]:
[[y['is_bumped'] for y in x] for x in messages_df['bumped_message_metadata']]

TypeError: string indices must be integers, not 'str'

I looked back at the example we did in class, and I think the issue is when we were accessing the 'name' attribute within 'artists' the data was a list, whereas here the 'is_bumped' values are not lists but a boolean (True or False - I think boolean is the correct language to use here). That is the only difference I can see so I think that is what the problem is, but I do not know how to fix it. I found [this artcile](https://builtin.com/data-science/string-indices-must-be-integers) about the "string indices must be integers" error but I didn't understand the solutions it was describing. And so 'bumped_message_metadata' is getting dropped too.

In [26]:
messages_df.drop('bumped_message_metadata', axis=1, inplace=True)

In [27]:
messages_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,,,,


I will now investigate the 'share' attribute to see if it can contain more than one value. As it turns out, if the value is not empty, it can contain a dictionary, and each dictionary only seems to contain one url. If it could contain more than one url, the dictionary would be contained within a list (encased by square brackets) so that multiple dictionaries could be included. Since that is not the case, I believe this technically counts as only being one value per cell, so I believe 'share' meets tidy data principles. 

In [31]:
messages_df['share'].iloc[0:10]

0                                                  NaN
1                                                  NaN
2                                                  NaN
3                                                  NaN
4                                                  NaN
5    {'link': 'https://www.youtube.com/watch?v=EDvK...
6                                                  NaN
7                                                  NaN
8                                                  NaN
9                                                  NaN
Name: share, dtype: object

In [32]:
messages_df['share'][5]

{'link': 'https://www.youtube.com/watch?v=EDvK6i86EZ0'}

In [33]:
messages_df['share'].iloc[11:20]

11                                                  NaN
12    {'link': 'https://new.chess24.com/wall/news/ne...
13                   {'link': 'https://www.chess.com/'}
14                                                  NaN
15                                                  NaN
16                   {'link': 'https://www.chess.com/'}
17                                                  NaN
18                                                  NaN
19    {'link': 'https://www.youtube.com/watch?v=YktW...
Name: share, dtype: object

Next I will investigate the 'photos' attribute to see if it can contain more than one value per cell. I found that for values that are not empty, they contain a list that includes a dictionary with a 'uri' key and a 'creation_timestamp' key. 

In [34]:
messages_df['photos'].iloc[0:10]

0                                                  NaN
1                                                  NaN
2                                                  NaN
3                                                  NaN
4                                                  NaN
5                                                  NaN
6                                                  NaN
7                                                  NaN
8    [{'uri': 'messages/inbox/chessbuds_npjakt9u1g/...
9    [{'uri': 'messages/inbox/chessbuds_npjakt9u1g/...
Name: photos, dtype: object

In [35]:
messages_df['photos'][8]

[{'uri': 'messages/inbox/chessbuds_npjakt9u1g/photos/312097852_642845703998797_5890449692529901285_n_642845700665464.png',
  'creation_timestamp': 1666197228}]

I think it would make sense to pull the 'uri' and 'creation_timestamp' keys and make each their own column because then they would each contain only one value and meet tidy standards. However, I have run into the same issue where many of the 'photos' rows have empty values (NaN), so I think I will run into the same issue I had before with floats. And I did.

In [38]:
[[y['uri'] for y in x] for x in messages_df['photos']]

TypeError: 'float' object is not iterable

In [39]:
[[y['creation_timestamp'] for y in x] for x in messages_df['photos']]

TypeError: 'float' object is not iterable

At this point I am wondering if there is something I can do about the NaN values throughout my DataFrame. I found DataFrame.fillna using the pandas documentation and wondered if that would help me. I used an example in [this page as a reference](https://en.wikipedia.org).

In [40]:
values = {'photos': "{'uri': 'N/A', 'creation_timestamp': 'N/A'}"}
messages_df.fillna(value=values).head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,,"{'uri': 'N/A', 'creation_timestamp': 'N/A'}",,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,,"{'uri': 'N/A', 'creation_timestamp': 'N/A'}",,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,,"{'uri': 'N/A', 'creation_timestamp': 'N/A'}",,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,,"{'uri': 'N/A', 'creation_timestamp': 'N/A'}",,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,,"{'uri': 'N/A', 'creation_timestamp': 'N/A'}",,


Based on the new DataFrame above, replacing the NaN values seems to have worked, but I do not know if doing so will now allow me to do what I want to do. Below I try again to pull the 'uri' and 'creation_timestamp' keys from 'photos.'

In [45]:
[[y['uri'] for y in x] for x in messages_df['photos']]

TypeError: 'float' object is not iterable

In [46]:
[[y['creation_timestamp'] for y in x] for x in messages_df['photos']]

TypeError: 'float' object is not iterable

It sill did not work. Looking back it's probably because I passed a string to the NaN values in 'photos' so it does make sense that the 'uri' and 'creation_timestamp' within the string were not recognized. I tried again without passing it into a string.

In [49]:
values = {'photos': {'uri': 'N/A', 'creation_timestamp': 'N/A'}}
messages_df.fillna(value=values).head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,,,,


However, it does not replace the NaN values if it is not in a string. It also might not have anything to do with the NaN values at all. I cannot think of how else I can solve this problem other than dropping 'photos' altogether like I have been with the other attributes I cannot figure out.

In [50]:
messages_df.drop('photos', axis=1, inplace=True)

In [51]:
messages_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,share,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,,,


Next I investigated both 'gifs' and 'users' to see if those can contain multiple values. 

In [52]:
messages_df['gifs']

0                                                    NaN
1                                                    NaN
2                                                    NaN
3                                                    NaN
4                                                    NaN
                             ...                        
218                                                  NaN
219    [{'uri': 'messages/inbox/chessbuds_npjakt9u1g/...
220                                                  NaN
221                                                  NaN
222                                                  NaN
Name: gifs, Length: 223, dtype: object

In [53]:
messages_df['gifs'][219]

[{'uri': 'messages/inbox/chessbuds_npjakt9u1g/gifs/50203035_295326594670631_5010610380540477440_n_729120860993392.gif'}]

In [58]:
messages_df['gifs'].iloc[41:50]

41                                                  NaN
42                                                  NaN
43    [{'uri': 'messages/inbox/chessbuds_npjakt9u1g/...
44                                                  NaN
45                                                  NaN
46                                                  NaN
47                                                  NaN
48                                                  NaN
49                                                  NaN
Name: gifs, dtype: object

In [59]:
messages_df['gifs'][43]

[{'uri': 'messages/inbox/chessbuds_npjakt9u1g/gifs/271509378_440207271109794_8171423686120017391_n_1734383773606976.gif'}]

In [60]:
messages_df['users']

0                                     NaN
1                                     NaN
2                                     NaN
3                                     NaN
4                                     NaN
                      ...                
218                                   NaN
219                                   NaN
220    [{'name': 'Angela Babbitt Pence'}]
221                                   NaN
222                                   NaN
Name: users, Length: 223, dtype: object

In [75]:
messages_df['users'].iloc[141:150]

141    [{'name': 'David Silva'}]
142                          NaN
143                          NaN
144                          NaN
145                          NaN
146    [{'name': 'David Silva'}]
147                          NaN
148                          NaN
149                          NaN
Name: users, dtype: object

So it seems both 'gifs' and 'users' contain lists, and within each list is a dictionary with only one element. However, because each dictionary is in a list I think that means both 'gifs' and 'users' have the ability to contain more than one dictionary (such as a message with more than one gif included) so I do not think these attributes count as tidy. However, I run into the same issue as before when trying to pull the keys out:

In [76]:
[[y['uri'] for y in x] for x in messages_df['gifs']]

TypeError: 'float' object is not iterable

In [77]:
[[y['name'] for y in x] for x in messages_df['users']]

TypeError: 'float' object is not iterable

I am not sure what else to do other than drop these two attributes.

In [78]:
messages_df.drop('gifs', axis=1, inplace=True)

In [79]:
messages_df.drop('users', axis=1, inplace=True)

### Final Tidy DataFrame

After investigating the structures of each piece of data in my DataFrame and trying to figure out how to tidy it, below is the first 5 rows of my final tidy DataFrame.

In [80]:
messages_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,type,is_unsent,is_taken_down,share
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,Generic,False,False,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,Generic,False,False,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,Generic,False,False,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",Generic,False,False,


In my new DataFrame, each row (observation) represents an individual message and each column represents a singluar variable/attribute of each message.

My final DataFrame meets tidy data principles because each column is a variable, each row is an observation, and only one observational unit (messages) make up the entire DataFrame. However, I got to this DataFrame by dropping several variables of data, which I realize is not the best solution and is especially problematic if that is data I need for any hypothetical visualizations, because then the story would be incomplete. I am sure there is an alternative tidy DataFrame I could make where each attribute/variable is its own column without having to drop any data. I do believe I was on the right track with trying to pull out certain pieces of data ('uri,' 'creation_timestamp,' 'name,' etc.) from within their dictionaries using list comprehension and turning them into their own columns so their values are by themselves as per tidy standards; but as of now I do not know how to correctly execute those tasks while the data exists in its current structure.   

### Potential Visualization

In regard to potential visualizations, I think choosing an appropriate visualization would depend on what aspects of this data you want to communicate. I think it would be very difficult and probably confusing to try to to visually represent every single variable in the DataFrame in one single visualization, so I think it would be better to determine the most important insights from the data and choose an appropriate visualization(s) to represent those insights. 

The simplest type of visualization that I can think of right now would be a regular bar chart where the x-axis represents each participant's name (from the 'sender_name' variable) and the y-axis is number of messages, so the height of each person's bar represents how many messages in the DataFrame were sent by them. This would tell you who is more or less active compared to the other participants, but that is really all. 

I am also thinking of a visualization concept kind of like a scatter plot where each message in the DataFrame is represented by a dot that could be hovered over to reveal the message's text (from the 'conent' variable) because I cannot think of an easier way to show that variable since many messages have long text. The dots would be organized across the x-axis by each particpant's name, similar to the bar chart idea above, and color coded based on the 'type' variable that the user could filter (toggle certain types on and off). This would allow someone to explore each message individually by participant and message type. I do not know what the y-axis would be in this visualization. I don't think the 'timestamp_ms' variable makes sense for the y-axis because each message would have a different value (unless two messages were sent at the exact same millisecond, which is not impossible but is very unlikely) so I think that would make for a very busy y-axis or one where it is very hard to determine where each dot lines up with its y-value. I suppose there doesn't need to be a y-axis. This is a concept where I would want to explore different options to see what makes sense, looks good, and conveys the message I want.