# Introduction to Visualisations with Plotly
Welcome to my notebook, where today we will be exploring the fantastic plotly library and I will show you the different ways you can use it to make awesome data analyses.

<img src="https://th.bing.com/th/id/Re2b554968bfb407bbe4c97fcf72ca7ab?rik=YBT0Pg%2bsl%2f9Gug&riu=http%3a%2f%2fwww.liyanatech.com%2fwp-content%2fuploads%2f2014%2f11%2fdata_analysis.jpg&ehk=3MVoPtiVUhWyphOFkvPXrZ579l%2fZu4zXGpTbLI9hl%2bM%3d&risl=&pid=ImgRaw" width="500px"/>

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from collections import Counter
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
from plotly.figure_factory import create_2d_density as density

We will be using the Student Exams dataset. Here is a preview:

In [None]:
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
df.head()

### Description of features
* **gender** - "male" or "female"
* **race/ethnicity** - Groups A, B, C, D and E
* **parental level of education** - what type of education the student's parents have received: "associate's degree", "bachelor's degree", "high school", "master's degree", "some college" or "some high school"
* **lunch** - type of meals that students have at school: "standard" or "free/reduced"
* **test preparation course** - whether they have completed the preparation for the test: "completed" or "none"
* **math score** - numerical range of scores for maths tests
* **reading score** - numerical range of scores for reading tests
* **writing score** - numerical range of scores for writing tests

# Pie graphs
#### What is this chart?
* Pie charts display various categories within data by representing them as slices of a circle, like a pizza. They can show how much of each variable there is in your dataset in relation to the others.

#### When is it useful?
* Pie graphs are used when there is a small amount of unique categories in the feature, which is done so that it can be easier to visualise them.

#### How do you use it?
* You use this chart with plotly express's 'pie' function, with the first parameter being the dataframe of the feature, and the second parameter being the feature you wish to visualise.

## Gender

Here we plot the gender of the students, seeing that there are slightly more girls than boys by almost by 2 percent.

In [None]:
px.pie(df, 'gender')

## Lunch

The next chart shows us that the free/reduced meals in the school are a little over a third, while the rest have standard lunches.

In [None]:
px.pie(df, 'lunch')

## Test preparation course

The final pie plot shows us that, again, a small bit over a third of pupils have completed the test preparation courses, while less than two thirds have not.

In [None]:
px.pie(df, 'test preparation course')

# Bar charts

#### What is this chart?
* A bar chart displays categorical variables through rectangles which are next to each other.

#### When is it useful?
* Bar charts are practical for variables with multiple categories, as they can showcase them without overcrowding.

#### How do you use it?
* The function I use for this chart is plotly express's 'bar'. The neat thing about this is that the first parameter you enter is the dataframe you wish to work on, therefore all you need to do next is provide the names for two of the dataframe's features you want to plot.

## Race/ethnicity

At its most basic step, we have plotted out how much of the various categories in the 'race/ethnicity' feature occur. This is done by firstly finding the distribution of the feature using collections' 'Counter' function, then by sorting out its contents into a dataframe called 'data' and finally defining it into our plotting function.

In [None]:
count = Counter(df['race/ethnicity'])
data = pd.DataFrame({'Ethnicity group':count.keys(), 'Number of students':count.values()})
px.bar(data, 'Ethnicity group', 'Number of students')

To give a more informative view, we can now sort 'data' into descending order, starting with the most commonly used category and making our way down to the least common one.

In [None]:
data = data.sort_values(by='Number of students', ascending=False)
px.bar(data, 'Ethnicity group', 'Number of students')

As a final touch, we can now add colour into our chart by specifying the 'colour' parameter in the 'bar' function to have the most frequent variables be yellow, the middle be purple and the least be blue. We also add in an 'Ethnicity group' title to make it even more informative.

In [None]:
data = data.sort_values(by='Number of students', ascending=False)
px.bar(data, 'Ethnicity group', 'Number of students', color='Number of students', title='Ethnicity group')

## Parental level of education

We can do another example of these.

Again, we plot out the distribution of the feature using the 'Counter' function, followed by visualising it using a bar chart.

In [None]:
count = Counter(df['parental level of education'])
data = pd.DataFrame({'Parental level of education':count.keys(), 'Number of students':count.values()})
px.bar(data, 'Parental level of education', 'Number of students')

Furthermore, we sort the bars out into descending order.

In [None]:
data = data.sort_values(by='Number of students', ascending=False)
px.bar(data, 'Parental level of education', 'Number of students')

Finally, we colour-code the bars and add a title to the graph.

In [None]:
data = data.sort_values(by='Number of students', ascending=False)
px.bar(data, 'Parental level of education', 'Number of students', color='Number of students', title='Parental level of education')

# Scatter plots
#### What is this chart?
* Scatter plots are graphs that present two numerical features by representing each pair of values using a circle.

#### When is it useful?
* These types of plots are used for numerical variables, as they have a wide range of unique values which can be scattered, while categorical variables have less options and are harder to determine their correlation.

#### How do you use it?
* Just as with the bar charts, the scatter plots are used by specifying the first parameter as the dataframe you desire to work on, followed by the names of the two features in the dataset that you want to scatter. The function we use here is plotly express's 'scatter'.

Here we have a basic scatter plot. We have used our 'df' function to scatter the maths and reading scores of our students.

In [None]:
px.scatter(df, 'math score', 'reading score')

## Gender

However, we can do something very interesting with this data. In addition to finding correlation, we can also determine whether the variables can classify another feature in the data. Here we can visualise the distinction between gender for the maths scores; boys being red and girls being blue. This is incredibly useful, as our takeaway from this plot is that the boys are better than the girls at maths, while the girls are better than the boys at reading.

In [None]:
px.scatter(df, 'math score', 'reading score', color='gender')

## Race/ethnicity

Next, we do the same thing with our 'race/ethnicity' feature. The chart here shows no clear difference between the various ethnicities in relation to their scores.

In [None]:
px.scatter(df, 'math score', 'reading score', color='race/ethnicity')

## Lunch

Now we try to distinguish the difference in scores between the students who have standard lunches and those who have free/reduced ones. Due to an imbalance of data (more 'standard lunch' samples than the rest), it may be difficult to recognise a straightforward connection between the people who eat lunches differently and their scores, so I cannot come to a conclusion with this data. We can, however, see this change further on in the notebook as we use graphs more suited to these features.

In [None]:
px.scatter(df, 'math score', 'reading score', color='lunch')

## Test preparation course

Here we compare the students who have done the test preparation against those who have not. Our conclusions are that, while it is perfectly possible for a person to do well on the test without preparation, pupils who have completed the preparation perform on average at a higher level on their tests than those who have not.

In [None]:
px.scatter(df, 'math score', 'reading score', color='test preparation course')

## Parental level of education

Now we take a look at what education levels the student's parents have achieved. A takeaway from this could be that students with parents who have a college degree may have slightly higher exam scores than those with parents who have soley finished high school.

In [None]:
px.scatter(df, 'math score', 'reading score', color='parental level of education')

# Subplots
#### What is this chart?
* Subplots are charts which allow you to fit in multiple graphs beside each other.

#### When is it useful?
* These plots can be commonly used whenever you want to compare the results of different charts or you wish to summarise the conclusions of your previous graphs.

#### How do you use it?
* Subplots are created by the 'make_subplots' function from plotly express, passing in the number of rows and columns, as well as the titles for the subplots if you wish. You can then add a graph to your plots using 'add_trace'. I create my charts for subplots through plotly graph_objs (abbreviated to 'go'), as it is quite difficult to use plotly express for this.

Firstly, we create subplots by defining two rows and two columns. The four feature names we will work with are also defined for the titles.

In [None]:
fig = make_subplots(rows=2, cols=2, subplot_titles=['gender', 'lunch', 'race/ethnicity', 'test preparation course'])

Here we can see the layout of our plots. We will analyse the connection between the reading and writing scores using the scatter function. The various feature names are seen above the graphs we will plot, along with the overall title 'Reading and writing scores' able to be seen at the top.

In [None]:
fig.update_layout(title_text='Reading and writing scores')

Next, we will now add our first chart. We can see here that I have encoded the 'gender' with a LabelEncoder so that the program can understand it, because it needs to be converted from categorical to numerical. The scatter plot we saw before which displays the reading and writing scores, as well as their connection to gender is shown here.

In [None]:
le = LabelEncoder()
fig.add_trace(go.Scatter(x=df['reading score'], y=df['writing score'], mode='markers', 
                         marker=dict(color=le.fit_transform(df['gender']))), 1, 1)

To get a bit more familiar with how this works, we can also see the 'lunch' feature being added into the second plot.

In [None]:
fig.add_trace(go.Scatter(x=df['reading score'], y=df['writing score'], mode='markers', 
                         marker=dict(color=le.fit_transform(df['lunch']))), 1, 2)

Now that we understand the mechanics of subplots, we can assemble all four together. I'd say those are some rather satisfying subplots, wouldn't you agree?

In [None]:
fig.add_trace(go.Scatter(x=df['reading score'], y=df['writing score'], mode='markers', 
                         marker=dict(color=le.fit_transform(df['race/ethnicity']))), 2, 1)
fig.add_trace(go.Scatter(x=df['reading score'], y=df['writing score'], mode='markers', 
                         marker=dict(color=le.fit_transform(df['test preparation course']))), 2, 2)

# Treemap
#### What is this chart?
* Treemaps help us visualise data by displaying nested rectangles.

#### When is it useful?
* They can be used whenever you have categorial features and you wish to determine their quantity, or how they relate to a numerical feature.

#### How do you use it?
* You use this through plotly express's 'treemap' function, specifying firstly the dataframe you are working upon, then the 'path' attribute being the various variables within the data you wish to display, and you can also use the 'color' parameter to show how the features relate to a numerical feature.

On the most basic level, we can see the difference in reading scores between the two genders. Hovering over the rectangles and recognising the difference in colour, we see that the females perform higher than the males in the reading.

In [None]:
fig = px.treemap(df, path=['gender'], color='reading score')
fig.show()

To make it a bit more complex, we can add the 'test preparation course' to be nested into our 'gender' rectangles. The data shows us that on average, those who perform the best at reading are girls who have finished their preparation course, while those who have done the worst are boys that have not done the course.

In [None]:
px.treemap(df, path=['gender', 'test preparation course'], color='reading score')

At the most complex level we will do here, we will now add a third layer to our visualisations: 'lunch'. The graph shows that those who have standard lunches have roughly better performance in reading than those with the free/reduced meals.

In [None]:
px.treemap(df, path=['gender', 'test preparation course', 'lunch'], color='reading score')

Let's do another example.

This time we analyse writing scores and the starting feature is 'lunch', which shows us (like the last chart) that students with standard lunches do better than those with free/reduced ones.

In [None]:
px.treemap(df, path=['lunch'], color='writing score')

The data is once again split further into binary variables, this time being the 'test preparation course'.

In [None]:
px.treemap(df, path=['lunch', 'test preparation course'], color='writing score')

Finally, we add in the 'parental level of education' feature as our final one. It tells us that the students with the worst writing scores have parents who only got a high school education, while the pupils with the best writing scores have parents who got college education. The higher the degree, the higher the marks.

In [None]:
px.treemap(df, path=['lunch', 'test preparation course', 'parental level of education'], color='writing score')

# Sunburst chart
#### What is this chart?
* The sunburst charts are similar to the treemaps, as they also visualise hierarchical data, using nested circles instead of rectangles.

#### When is it useful?
* As with the previous graph, the sunburt chart is used when displaying hierarchical data.

#### How do you use it?
* This graph is used by specifying the dataset and then specifying the 'path' attribute with the various nested features within the data.

Using one variable (test preparation course), the sunburst is basically a pie chart.

In [None]:
px.sunburst(df, path=['test preparation course'])

Though, things get interesting whenever we use multiple features. The plot shows us that there are more students who have not completed the preparation course than those who have, while also stating that there are more pupils with standard rather than free/reduced lunches.

In [None]:
px.sunburst(df, path=['test preparation course', 'lunch'])

To make it even more interesting, we now add the 'gender' variable as our last on the 'path' attribute, with the final touch being a colour-coded display showing the reading score of each variable.

In [None]:
px.sunburst(df, path=['test preparation course', 'lunch', 'gender'], color='reading score')

# Density plot
#### What is this chart?
* The purpose of a density plot is related to that of a scatter plot. You take in two numerical features and you scatter them. However, the difference is that a density plot shows the most populated areas of your graph using various lines and shades, sort of like a contour map. Along with that, we can also display the distribution of each feature using a histogram.

#### When is it useful?
* The density plot is useful whenever you want to know the distribution of features and the most common samples across a 2D scatter graph.

#### How do you use it?
* The graph is used by specifying two numerical columns in the 'create_2d_density' function.

Here we use a density plot to scatter the math and writing scores of the pupils. Our conclusions from this graph are that the most common writing scores are in the mid 70s, while the most common math scores are in the mid 60s. We can see this as we hover over the the darkest blue portion of the graph.

In [None]:
density(df['math score'], df['writing score'])

We see similar results as we plot out the math and reading scores of the pupils.

In [None]:
density(df['math score'], df['reading score'])

Now we come across something very intriguing: the reading and writing scores are very interlinked. Analysing the chart below, we can see a much clearer and stronger correlation between these two variables than we saw in the rest. In particular, there seem to be three clusters that the data is grouped into: one in 54.5, another one in 64.5 and the strongest one with 74.5 for each variable.

In [None]:
density(df['reading score'], df['writing score'])

In summary, plotly is an amazing tool for EDA which can provide us with a wide range of charts to help us solve various problems. I hope this notebook was helpful in showing you the incredible opportunities of plotly.

<img src="https://3.bp.blogspot.com/-Y1sdl8xLBPY/WVPm5-sUULI/AAAAAAAAaOw/PeZ3hGtQh_ogzEPyfcjys98VAJta-qc0QCEwYBhgL/s320/ZyAGpCRUBh-Q_RGK7SsecNNHylV0SHhVbmJ_AEkCtuo.jpg" width="450px"/>

### Thank you for reading my notebook.
### If you enjoyed this notebook, please upvote it and give feedback.

### To learn more about choropleth maps and time series data, you can check out my notebook:
https://www.kaggle.com/dabawse/introduction-to-choropleth-maps-and-time-series