# 1st Data working session. DWT
#### 28/08/2023 8:05-9pm

The data we tend to come across in online courses and tutorials tends to come from other countries. In case that makes Data science feel like something that is only relevant in a context far away, we decided to evaluate our own data this time.

This dataset is formed from the Google form responses of people who signed up for the Data with Tabitha meetup in June 2023. The structure of the form can be found here: https://docs.google.com/forms/d/e/1FAIpQLSciK0pCAoKpFFe9X9GNHbykR3QuLjFZCnELB5V2JuYHjwlAjQ/viewform?usp=sf_link

In [1]:
import pandas as pd

In [2]:
df = pd.read_excel("Data with Tabitha (Responses).xlsx")

### de-duplicate
If the form was filled more than once by the same person, drop the duplicate columns. The phone numbers can be used to check this.

In [3]:
df = df.drop_duplicates(subset=["Whatsapp number "])

### pseudo-anonymising the data

Find out what columns are in the data and remove the columns that contain information that can be used to identify persons (personal identifiable information).


In [4]:
df.columns

Index(['Timestamp', 'Name', 'Whatsapp number ', 'Email address ',
       'About you. Tick the boxes that are true for you.'],
      dtype='object')

In [5]:
df = df.drop(columns=["Name", "Email address ", "Whatsapp number "])

### see what sort of data values make up the dataset.

#### summary statistics
df.describe works really well with numerical data. It gives an overall idea of statistical spread (mean, mode, median, variance, standard deviation) of the dataset.

However, we see that it struggles to produces useful insight about our dataset which is mainly made of text.

In [6]:
df.describe()

  df.describe()


Unnamed: 0,Timestamp,About you. Tick the boxes that are true for you.
count,32,32
unique,32,16
top,2023-06-25 07:44:03.363000,"Can code in one programming language, Can code..."
freq,1,6
first,2023-06-25 07:44:03.363000,
last,2023-06-29 05:05:58.522000,


#### seeing sample rows of dataset
We can choose to see the first or last few rows of the dataset so that we know what kind of data each column has.

In [7]:
df.head()

Unnamed: 0,Timestamp,About you. Tick the boxes that are true for you.
0,2023-06-25 07:44:03.363,"Can code in one programming language, Can code..."
1,2023-06-25 07:54:24.820,"Can code in one programming language, Can code..."
2,2023-06-25 07:55:56.577,"Can code in one programming language, Can code..."
4,2023-06-25 09:59:42.652,"Can code in one programming language, Can code..."
5,2023-06-25 10:10:50.651,"Can code in one programming language, Can code..."


In [8]:
df.tail()

Unnamed: 0,Timestamp,About you. Tick the boxes that are true for you.
32,2023-06-26 15:07:49.325,"Can code in Python or R, Studying (or studied)..."
34,2023-06-26 16:14:43.350,"Can code in one programming language, Can code..."
35,2023-06-26 16:46:50.196,"Can code in one programming language, Can code..."
36,2023-06-26 17:37:54.235,Studying (or studied) a Data science related t...
37,2023-06-29 05:05:58.522,Studying (or studied) a Data science related t...


## Finding insights

We can explore the "about you" columns to answer questions such as
- How many of the people who signed up for the meetup could code in at least one programming language?

### Exploring the "About you" column
We start by getting a view of how an entry in this column looks like. It seems to comprise of the text values of all the options that the participant ticked on the last question of the google form. The responses are combined in a string and separated from each other by commas within the string.

In [9]:
# index 0 gets the content of the first row of that column.
df["About you. Tick the boxes that are true for you."][0]

'Can code in one programming language, Can code in Python or R, Have built a Data science related project, Uses Data science concepts at work.'

In [10]:
# a shorter name will make our work much easier.
df = df.rename(columns = {"About you. Tick the boxes that are true for you.": "responses"})

In [11]:
# check that the rename worked.
df.columns

Index(['Timestamp', 'responses'], dtype='object')

The next instinct might be to print out unique values of our column of interest to see what categories exist. However, that doesn't seem to take us closer to knowing how many people knew how to code.

In [12]:
df['responses'].unique()

array(['Can code in one programming language, Can code in Python or R, Have built a Data science related project, Uses Data science concepts at work.',
       'Can code in one programming language, Can code in Python or R, Studying (or studied) a Data science related thing, Have built a Data science related project',
       'Can code in one programming language, Can code in Python or R, Studying (or studied) a Data science related thing',
       'Can code in one programming language, Can code in Python or R',
       'Can code in one programming language',
       'Can code in Python or R, Studying (or studied) a Data science related thing',
       'Studying (or studied) a Data science related thing',
       'Uses Data science concepts at work.',
       'Can code in one programming language, DS enthusiast ',
       'Can code in one programming language, ',
       'Can code in Python or R',
       'Can code in one programming language, Studying (or studied) a Data science related thing',


So we split out the large strings into their component sub-responses in a way that allows us to count how many people chose each sub-response.
df.explode() will help us get one sub-response per row. It needs to act on a list not a string so we split the string into a list of sub-responses first, then explode on that column.

In [13]:
df["resp_list"]= df['responses'].str.split(',')

In [14]:
df.head()

Unnamed: 0,Timestamp,responses,resp_list
0,2023-06-25 07:44:03.363,"Can code in one programming language, Can code...","[Can code in one programming language, Can co..."
1,2023-06-25 07:54:24.820,"Can code in one programming language, Can code...","[Can code in one programming language, Can co..."
2,2023-06-25 07:55:56.577,"Can code in one programming language, Can code...","[Can code in one programming language, Can co..."
4,2023-06-25 09:59:42.652,"Can code in one programming language, Can code...","[Can code in one programming language, Can co..."
5,2023-06-25 10:10:50.651,"Can code in one programming language, Can code...","[Can code in one programming language, Can co..."


In [15]:
 df = df.explode("resp_list")

In [16]:
df['resp_list'] = df['resp_list'].str.strip()

In [17]:
df['resp_list'].value_counts()

Can code in one programming language                              22
Can code in Python or R                                           21
Studying (or studied) a Data science related thing                20
Have built a Data science related project                          8
Uses Data science concepts at work.                                3
DS enthusiast                                                      1
                                                                   1
Curious about data science                                         1
Software Engineer | DevOps | IT support specialist with Python     1
Name: resp_list, dtype: int64

Now we know that 22 of the participants who signed up to the meet-and-greet could code in one programming language.

## Further work
We could go on to calculate the percentages of people that chose each subcategory, with respect to the total number of people who signed up.

Also, we could look at the timestamp column and observe if there are any patterns in the intervals between when the form was filled by the various participants.
Did the majority fill the form within the first hour?
What is the average time between responses? (keep in mind that responses are independent of each other)

Moreso, we could explore if there are more efficient ways of getting the data we need out of the responses column without exploding the dataset. This is because the explode process significantly increases the size of the data.

That's it for now, Good night ! :)