# Creating Data Visualisations Using Python

Creating data visualisations in Python typically involves three key stages:

1. Loading the dataset
2. Cleaning and shaping the relevant data
3. Visually presenting the insights

> __Note__: We'll explore each of these stages using the [Kickstarter Projects dataset](https://www.kaggle.com/datasets/kemical/kickstarter-projects?select=ks-projects-201801.csv).

## 1. Loading the Dataset

### What happens at this stage?

This is where we import our data into our python environment.

### What are some common data sources?

We can load data stored in common file formats such as `.csv`, `.xlsx`, and `.json`. Data can also be loaded directly from databases.

### Which Python libraries can be used for this task?

- [pandas](https://pandas.pydata.org/docs/) – Used to read data from various file formats and return a DataFrame.
- [openpyxl](https://openpyxl.readthedocs.io/en/stable/#documentation) or [xlrd](https://xlrd.readthedocs.io/en/latest/) – Useful for working with older Excel file formats.
- [sqlalchemy](https://docs.sqlalchemy.org/en/20/) – An Object Relational Mapper (ORM) for Python that allows access to SQL databases and loading of data.

### Can you show us an example?

In [8]:
# Import pandas library
# (documentation) https://www.w3schools.com/python/python_modules.asp

import pandas as pd

# Load kickstarter-projects.csv into memory, outputting a pandas DataFrame
# (documentation) https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

df = pd.read_csv('../data/kickstarter-projects.csv')

# Print the DataFrame's contents as a sanity check
# (documentation) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

df

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.0,2014-09-17 02:35:30,25.0,canceled,1,US,25.0,25.0,50000.00
378657,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.0,2011-06-22 03:35:14,155.0,failed,5,US,155.0,155.0,1500.00
378658,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.0,2010-07-01 19:40:30,20.0,failed,1,US,20.0,20.0,15000.00
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.00


## 2. Cleaning and Shaping the Dataset

### What happens at this stage?

Here, we identify and extract data relevant to the insights we want to explore. We then clean and structure the data to ensure it is accurate, complete, and ready for analysis or visualisation. This process can include:

- Handling missing values (`df.dropna()`, `df.fillna()`)
- Renaming or selecting specific columns
- Changing data types (e.g., `df['date'] = pd.to_datetime(df['date'])`)
- Filtering or grouping data
- Merging or joining datasets

### What tools and libraries are commonly used?

- [pandas](https://pandas.pydata.org/docs/) – For powerful data manipulation and transformation
- [numpy](https://numpy.org/doc/stable/) – For numerical operations and array-based computations

### Can you show us an example?

In [9]:
# First, let's explore the dataset and decide what insights we would like to focus on:
#  - Which projects are least/most successful in terms of achieving their funding goals?
#    - By country, by main category, by category?
#  - Which projects are least/most successful in terms of overall project success?
#    - By country, by main category, by category?
#  - Which projects attract the most funding?
#  - Which projects experience the least/most blockers?

## 3. Visually Presenting the Insights

### What happens at this stage?

This is where we use our cleaned and prepared data to create visual representations—such as charts and graphs—that help communicate the insights effectively.

### Which libraries are commonly used for data visualisation?

- [matplotlib](https://matplotlib.org/stable/index.html) – A foundational plotting library that provides fine-grained control, though it can be more verbose.
- [seaborn](https://seaborn.pydata.org/tutorial.html) – Built on top of matplotlib, it offers a higher-level interface for creating statistical graphics.
- [plotly](https://plotly.com/python/) – Allows creation of rich interactive and web-based visualisations.
- [pandas](https://pandas.pydata.org/docs/) – Offers built-in plotting capabilities via .plot().