# Data Wrangling With Python 
(using Pandas)

**About Pandas**: pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

**Highlights:**

* A fast and efficient **DataFrame** object for data manipulation with integrated indexing;

* Tools for **reading and writing data** between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

* Intelligent **data alignment and integrated handling of missing data**: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

* Flexible **reshaping and pivoting** of data sets;

* Intelligent label-based **slicing**, fancy indexing, and subsetting of large data sets;

* **Columns can be inserted and deleted** from data structures for size mutability;

* Aggregating or transforming data with a powerful **group by** engine allowing split-apply-combine operations on data sets;

* High performance **merging and joining** of data sets;

* Hierarchical axis indexing provides an intuitive way of working with **high-dimensional** data in a lower-dimensional data structure;

* **Time series-functionality**: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

- Source: https://pandas.pydata.org/about/
- Official documentation: https://pandas.pydata.org/docs/
- Further Reading: [Effective Pandas, Matt Harrison](https://www.amazon.com/Effective-Pandas-Patterns-Manipulation-Treading/dp/B09MYXXSFM?ref_=ast_author_dp)



## Load pandas

In [1]:
import pandas as pd

##Loading data

Download some sample data to our local filesystem (Linux, Mac)

In [2]:
!curl https://ucarecdn.com/8d8cd2ee-47d4-474f-b3a7-66eb9a20b43e/retail_data_clean.csv --output retail_data_clean.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4806k  100 4806k    0     0  15.5M      0 --:--:-- --:--:-- --:--:-- 15.5M


### Load Data from CSV

Doc: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html



Load a CSV file from your local hard drive:

In [3]:
pd.read_csv("retail_data_clean.csv")

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
0,13047,536367,2010-12-01 08:34:00,84879,32,1.69,54.08
1,13047,536367,2010-12-01 08:34:00,22745,6,2.10,12.60
2,13047,536367,2010-12-01 08:34:00,22748,6,2.10,12.60
3,13047,536367,2010-12-01 08:34:00,22749,8,3.75,30.00
4,13047,536367,2010-12-01 08:34:00,22310,6,1.65,9.90
...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6,2.89,17.34
91146,17581,581581,2011-12-09 12:20:00,23561,6,2.89,17.34
91147,17581,581581,2011-12-09 12:20:00,23681,10,1.65,16.50
91148,17581,581582,2011-12-09 12:21:00,23552,6,2.08,12.48


In [4]:
# Custom attributes
pd.read_csv("retail_data_clean.csv", 
            sep = ",", 
            header = 0, 
            usecols = ['CustomerID', 'InvoiceNo', 'Quantity'],
            nrows = 10,
            encoding = 'utf-8') # https://docs.python.org/3/library/codecs.html#standard-encodings

Unnamed: 0,CustomerID,InvoiceNo,Quantity
0,13047,536367,32
1,13047,536367,6
2,13047,536367,6
3,13047,536367,8
4,13047,536367,6
5,13047,536367,6
6,13047,536367,3
7,13047,536367,2
8,13047,536367,3
9,13047,536367,3


Load a CSV file from a URL

In [5]:
pd.read_csv("https://ucarecdn.com/8d8cd2ee-47d4-474f-b3a7-66eb9a20b43e/retail_data_clean.csv", 
            usecols = ['CustomerID', 'InvoiceNo', 'Quantity'],
            nrows = 10) 

Unnamed: 0,CustomerID,InvoiceNo,Quantity
0,13047,536367,32
1,13047,536367,6
2,13047,536367,6
3,13047,536367,8
4,13047,536367,6
5,13047,536367,6
6,13047,536367,3
7,13047,536367,2
8,13047,536367,3
9,13047,536367,3


Read a CSV file from a URL and store it in a Pandas DataFrame called "df".

In [6]:
df = pd.read_csv("https://ucarecdn.com/8d8cd2ee-47d4-474f-b3a7-66eb9a20b43e/retail_data_clean.csv")
df

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
0,13047,536367,2010-12-01 08:34:00,84879,32,1.69,54.08
1,13047,536367,2010-12-01 08:34:00,22745,6,2.10,12.60
2,13047,536367,2010-12-01 08:34:00,22748,6,2.10,12.60
3,13047,536367,2010-12-01 08:34:00,22749,8,3.75,30.00
4,13047,536367,2010-12-01 08:34:00,22310,6,1.65,9.90
...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6,2.89,17.34
91146,17581,581581,2011-12-09 12:20:00,23561,6,2.89,17.34
91147,17581,581581,2011-12-09 12:20:00,23681,10,1.65,16.50
91148,17581,581582,2011-12-09 12:21:00,23552,6,2.08,12.48


### Load Data from Excel

Doc: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Read the first 10 rows of an Excel file located at a given URL and store the data in a Pandas DataFrame

In [7]:
file_path = "https://ucarecdn.com/82a291a5-5617-4ca7-aba3-54e1707785c3/retail_data_s.xlsx"
pd.read_excel(file_path, 
              sheet_name = 0, 
              nrows = 10)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice
0,13047,536367,2010-12-01 08:34:00,84879,32,1.69
1,13047,536367,2010-12-01 08:34:00,22745,6,2.1
2,13047,536367,2010-12-01 08:34:00,22748,6,2.1
3,13047,536367,2010-12-01 08:34:00,22749,8,3.75
4,13047,536367,2010-12-01 08:34:00,22310,6,1.65
5,13047,536367,2010-12-01 08:34:00,84969,6,4.25
6,13047,536367,2010-12-01 08:34:00,22623,3,4.95
7,13047,536367,2010-12-01 08:34:00,22622,2,9.95
8,13047,536367,2010-12-01 08:34:00,21754,3,5.95
9,13047,536367,2010-12-01 08:34:00,21755,3,5.95


### Load Data from SQL

Doc: https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html

In [8]:
# For testing purposes simulate a SQL server from memory
from sqlite3 import connect
db_connection = connect(':memory:')
df.to_sql('transactions_table', db_connection)

91150

Read data with SQL from a table using a database connection.

In [9]:
pd.read_sql(sql = 'SELECT * FROM transactions_table LIMIT 5', 
            con = db_connection) # https://docs.sqlalchemy.org/en/13/core/connections.html

Unnamed: 0,index,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
0,0,13047,536367,2010-12-01 08:34:00,84879,32,1.69,54.08
1,1,13047,536367,2010-12-01 08:34:00,22745,6,2.1,12.6
2,2,13047,536367,2010-12-01 08:34:00,22748,6,2.1,12.6
3,3,13047,536367,2010-12-01 08:34:00,22749,8,3.75,30.0
4,4,13047,536367,2010-12-01 08:34:00,22310,6,1.65,9.9


Define a more complex query

In [10]:
query = '''
SELECT CustomerID
,InvoiceNo
,SUM(Quantity) as Total_Quantity
FROM transactions_table
GROUP BY CustomerID, InvoiceNo
ORDER BY Total_Quantity DESC
'''

Pass the query as a variable:

In [11]:
pd.read_sql(sql = query, con = db_connection)

Unnamed: 0,CustomerID,InvoiceNo,Total_Quantity
0,17450,567423,12572
1,18251,566595,7824
2,17450,567381,6760
3,14156,541220,6198
4,14298,571653,5918
...,...,...,...
4306,17135,578232,1
4307,17230,539645,1
4308,17581,552648,1
4309,17817,545900,1


### Load Data from more sources

https://pandas.pydata.org/docs/reference/io.html


## Data Wrangling

### Cleaning Data

#### List data types

Display the data type of each column in a Pandas DataFrame.

In [12]:
df.dtypes

CustomerID       int64
InvoiceNo        int64
InvoiceDate     object
StockCode       object
Quantity         int64
UnitPrice      float64
Revenue        float64
dtype: object

#### Numeric to string

https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html

Convert a column from any other data type to a string data type:

In [13]:
df['CustomerID'] = df['CustomerID'].astype(str)

In [14]:
df.dtypes

CustomerID      object
InvoiceNo        int64
InvoiceDate     object
StockCode       object
Quantity         int64
UnitPrice      float64
Revenue        float64
dtype: object

#### String to numeric
https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html

Convert a column in a dataframe to integer data type (whole numbers).

In [15]:
# to integer (whole numbers)
df['CustomerID'].astype(int)

0        13047
1        13047
2        13047
3        13047
4        13047
         ...  
91145    17581
91146    17581
91147    17581
91148    17581
91149    17581
Name: CustomerID, Length: 91150, dtype: int64

Convert a column in a dataframe to float data type (decimal numbers).

In [16]:
df['CustomerID'].astype(float)

0        13047.0
1        13047.0
2        13047.0
3        13047.0
4        13047.0
          ...   
91145    17581.0
91146    17581.0
91147    17581.0
91148    17581.0
91149    17581.0
Name: CustomerID, Length: 91150, dtype: float64

#### String to date
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html


In [17]:
df['InvoiceDate']

0        2010-12-01 08:34:00
1        2010-12-01 08:34:00
2        2010-12-01 08:34:00
3        2010-12-01 08:34:00
4        2010-12-01 08:34:00
                ...         
91145    2011-12-09 12:20:00
91146    2011-12-09 12:20:00
91147    2011-12-09 12:20:00
91148    2011-12-09 12:21:00
91149    2011-12-09 12:21:00
Name: InvoiceDate, Length: 91150, dtype: object

Convert a column in a dataframe to datetime format with automated parsing.


In [18]:
pd.to_datetime(df['InvoiceDate'])

0       2010-12-01 08:34:00
1       2010-12-01 08:34:00
2       2010-12-01 08:34:00
3       2010-12-01 08:34:00
4       2010-12-01 08:34:00
                ...        
91145   2011-12-09 12:20:00
91146   2011-12-09 12:20:00
91147   2011-12-09 12:20:00
91148   2011-12-09 12:21:00
91149   2011-12-09 12:21:00
Name: InvoiceDate, Length: 91150, dtype: datetime64[ns]

Overwrite the original column

In [19]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

In [20]:
df['InvoiceDate'].describe()

  df['InvoiceDate'].describe()


count                   91150
unique                   4227
top       2011-10-20 15:57:00
freq                      222
first     2010-12-01 08:34:00
last      2011-12-09 12:21:00
Name: InvoiceDate, dtype: object

#### Sorting Data
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Sort a DataFrame by one column in ascending order.

In [21]:
df.sort_values("InvoiceNo", ascending = True)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
0,13047,536367,2010-12-01 08:34:00,84879,32,1.69,54.08
1,13047,536367,2010-12-01 08:34:00,22745,6,2.10,12.60
2,13047,536367,2010-12-01 08:34:00,22748,6,2.10,12.60
3,13047,536367,2010-12-01 08:34:00,22749,8,3.75,30.00
4,13047,536367,2010-12-01 08:34:00,22310,6,1.65,9.90
...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6,2.89,17.34
91146,17581,581581,2011-12-09 12:20:00,23561,6,2.89,17.34
91147,17581,581581,2011-12-09 12:20:00,23681,10,1.65,16.50
91148,17581,581582,2011-12-09 12:21:00,23552,6,2.08,12.48


Sort a DataFrame by multiple columns in ascending order.

In [22]:
# Sort by CustomerID, then by InvoiceNo
df.sort_values(["CustomerID", "InvoiceNo"], ascending = [True, False])

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
90384,12347,581180,2011-12-07 15:52:00,23497,12,1.45,17.40
90385,12347,581180,2011-12-07 15:52:00,23552,6,2.08,12.48
90386,12347,581180,2011-12-07 15:52:00,21064,24,1.25,30.00
90387,12347,581180,2011-12-07 15:52:00,84625A,24,0.85,20.40
90388,12347,581180,2011-12-07 15:52:00,21731,24,1.65,39.60
...,...,...,...,...,...,...,...
32704,18287,554065,2011-05-22 10:39:00,85040A,36,1.65,59.40
32705,18287,554065,2011-05-22 10:39:00,85039B,60,1.45,87.00
32706,18287,554065,2011-05-22 10:39:00,85039B,12,1.65,19.80
32707,18287,554065,2011-05-22 10:39:00,85040A,12,1.65,19.80


#### 'Fixing' data

In [23]:
df['StockCode'].astype(int)

ValueError: ignored

Best practice for replacing values: Keep logic inside a function

Define a function to delete all alphabetic characters in the input variable and return the modified variable as a string.

In [25]:
def remove_character_from_stockcode(stock_code):
  # Convert to string
  stock_code_str = str(stock_code) 
  
  # Go through all characters in each stock code and replace letters
  for character in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
    stock_code_str = stock_code_str.replace(character, '')
 
  return(stock_code_str)

In [26]:
# Test your function
remove_character_from_stockcode('85049A')

'85049'

Create a new column in a dataframe by applying a function on each element of another column.

https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

In [27]:
df['StockCodeNumeric'] = df['StockCode'].map(remove_character_from_stockcode)

In [28]:
df['StockCodeNumeric'] = df['StockCodeNumeric'].astype(int)

In [29]:
df.sort_values("StockCode")

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric
25457,18079,550272,2011-04-15 12:14:00,10002,62,0.85,52.70,10002
9454,12748,541248,2011-01-16 13:04:00,10002,1,0.85,0.85,10002
14256,15382,544278,2011-02-17 12:01:00,10002,12,0.85,10.20,10002
9935,12451,541518,2011-01-19 09:05:00,10002,12,0.85,10.20,10002
10430,13230,541849,2011-01-23 13:34:00,10002,2,0.85,1.70,10002
...,...,...,...,...,...,...,...,...
516,14606,536591,2010-12-01 16:57:00,90214S,1,1.25,1.25,90214
515,14606,536591,2010-12-01 16:57:00,90214V,1,1.25,1.25,90214
36539,14606,556202,2011-06-09 13:08:00,90214V,1,1.25,1.25,90214
31912,14606,553503,2011-05-17 13:20:00,90214Y,1,1.25,1.25,90214


#### Missing data
https://pandas.pydata.org/docs/user_guide/missing_data.html

Check if there are any missing values (NaN) in each column of a pandas DataFrame.

In [30]:
df.isna().any()

CustomerID          False
InvoiceNo           False
InvoiceDate         False
StockCode           False
Quantity            False
UnitPrice           False
Revenue             False
StockCodeNumeric    False
dtype: bool

In [31]:
# let's create a missing value...
print(df.iloc[0,4])
df.iloc[0,4] = None

32


In [32]:
df.isna().any()

CustomerID          False
InvoiceNo           False
InvoiceDate         False
StockCode           False
Quantity             True
UnitPrice           False
Revenue             False
StockCodeNumeric    False
dtype: bool

Fill missing value

In [33]:
df['Quantity']

0         NaN
1         6.0
2         6.0
3         8.0
4         6.0
         ... 
91145     6.0
91146     6.0
91147    10.0
91148     6.0
91149    12.0
Name: Quantity, Length: 91150, dtype: float64

Fill missing values in a column with a static value.

In [34]:
df['Quantity'].fillna(32)

0        32.0
1         6.0
2         6.0
3         8.0
4         6.0
         ... 
91145     6.0
91146     6.0
91147    10.0
91148     6.0
91149    12.0
Name: Quantity, Length: 91150, dtype: float64

Fill missing values in a column with a calculated value.

In [35]:
mean = round(df['Quantity'].mean(),2)
df['Quantity'].fillna(mean)

0        12.33
1         6.00
2         6.00
3         8.00
4         6.00
         ...  
91145     6.00
91146     6.00
91147    10.00
91148     6.00
91149    12.00
Name: Quantity, Length: 91150, dtype: float64

Remove all rows with NA

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

In [36]:
df.dropna()

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310
5,13047,536367,2010-12-01 08:34:00,84969,6.0,4.25,25.50,84969
...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552


Remove all columns with NA

In [37]:
df.dropna(axis = 1)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,UnitPrice,Revenue,StockCodeNumeric
0,13047,536367,2010-12-01 08:34:00,84879,1.69,54.08,84879
1,13047,536367,2010-12-01 08:34:00,22745,2.10,12.60,22745
2,13047,536367,2010-12-01 08:34:00,22748,2.10,12.60,22748
3,13047,536367,2010-12-01 08:34:00,22749,3.75,30.00,22749
4,13047,536367,2010-12-01 08:34:00,22310,1.65,9.90,22310
...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,2.89,17.34,23562
91146,17581,581581,2011-12-09 12:20:00,23561,2.89,17.34,23561
91147,17581,581581,2011-12-09 12:20:00,23681,1.65,16.50,23681
91148,17581,581582,2011-12-09 12:21:00,23552,2.08,12.48,23552


Remove NA with threshold.

Example: Drops any column from the dataframe 'df' that has less than 10 missing values.

In [38]:
df.dropna(axis = 1, thresh = 10)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,54.08,84879
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310
...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552


### Integration

Let's get another dataset!

In [39]:
stock_codes_df = pd.read_csv("https://ucarecdn.com/5cef20a8-c7d8-46e1-af8a-830388dc89c9/stock_codes.csv")
stock_codes_df

Unnamed: 0,StockCode,Description
0,10002,INFLATABLE POLITICAL GLOBE
1,10080,GROOVY CACTUS INFLATABLE
2,10120,DOGGY RUBBER
3,10125,MINI FUNKY DESIGN TAPES
4,10133,COLOURING PENCILS BROWN TUBE
...,...,...
3304,90214Y,"LETTER ""Y"" BLING KEY RING"
3305,BANK CHARGES,Bank Charges
3306,C2,CARRIAGE
3307,M,Manual


#### Join (Merge)
https://pandas.pydata.org/docs/reference/api/pandas.merge.html

Merge two pandas dataframes using a left join on a single column.

In [40]:
pd.merge(left = df, right = stock_codes_df, how = "left", on = "StockCode")

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,54.08,84879,ASSORTED COLOUR BIRD ORNAMENT
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745,POPPY'S PLAYHOUSE BEDROOM
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748,POPPY'S PLAYHOUSE KITCHEN
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310,IVORY KNITTED MUG COSY
...,...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562,SET OF 6 RIBBONS PERFECTLY PRETTY
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561,SET OF 6 RIBBONS PARTY
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681,LUNCH BAG RED VINTAGE DOILY
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552,BICYCLE PUNCTURE REPAIR KIT


In [41]:
df = pd.merge(left = df, right = stock_codes_df, how = "left")

#### Concatenation

https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Create some dummy data:

In [42]:
# Create data frame with some data
df1 = pd.DataFrame({
    'Name': ['John', 'Jane'],
    'Age': [32, 28],
    'Country': ['USA', 'Canada'],
    'Salary': [50000, 52000]
})

df2 = pd.DataFrame({
    'Name': ['Jim', 'Joan'],
    'Age': [41, 35],
    'Country': ['UK', 'Australia'],
    'Salary': [55000, 60000]
})

df3 = pd.DataFrame({
    'Gender': ['Male', 'Female'],
    'Group': ['A', 'B']
})

Concatenate the rows of two dataframes and reset the index to start from 0.

In [43]:
pd.concat([df1, df2]).reset_index(drop = True)

Unnamed: 0,Name,Age,Country,Salary
0,John,32,USA,50000
1,Jane,28,Canada,52000
2,Jim,41,UK,55000
3,Joan,35,Australia,60000


Concatenate the columns of two dataframes.

In [44]:
pd.concat([df1, df3], axis = 1) #axis 1... columns, axis 0... rows

Unnamed: 0,Name,Age,Country,Salary,Gender,Group
0,John,32,USA,50000,Male,A
1,Jane,28,Canada,52000,Female,B


### Transformation

#### Modeling
https://pandas.pydata.org/docs/user_guide/reshaping.html

Long to wide

In [45]:
# Tidy Data
df1 = pd.DataFrame({
    'Name': ['John', 'Jane', 'Jim', 'Joan'],
    'Age': [32, 28, 41, 35],
    'Country': ['USA', 'Canada', 'UK', 'Australia'],
    'Salary': [50000, 52000, 55000, 60000]
})

Transform a DataFrame from wide to long format by unpivoting the columns into rows while keeping the ID column fixed.

In [46]:
df_long = pd.melt(df1, id_vars=["Name"])
df_long

Unnamed: 0,Name,variable,value
0,John,Age,32
1,Jane,Age,28
2,Jim,Age,41
3,Joan,Age,35
4,John,Country,USA
5,Jane,Country,Canada
6,Jim,Country,UK
7,Joan,Country,Australia
8,John,Salary,50000
9,Jane,Salary,52000


Pivot a long-formatted DataFrame into a tidy (wide) format while specifying index, columns and cell values.

In [47]:
df_tidy = pd.pivot(df_long, index="Name", columns="variable", values="value")
df_tidy

variable,Age,Country,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jane,28,Canada,52000
Jim,41,UK,55000
Joan,35,Australia,60000
John,32,USA,50000


In [48]:
# Reset Index
df_tidy = df_tidy.reset_index()
df_tidy

variable,Name,Age,Country,Salary
0,Jane,28,Canada,52000
1,Jim,41,UK,55000
2,Joan,35,Australia,60000
3,John,32,USA,50000


#### Binning
https://pandas.pydata.org/docs/reference/api/pandas.cut.html


Create equal-width bins for a numeric column in a DataFrame. 

Equal-width means the range of values is divided into a fixed number of intervals (bins) of the same width or size so that each bin has an equal range of values.

In [49]:
pd.cut(df['UnitPrice'], bins = 3)

0        (-0.255, 98.36]
1        (-0.255, 98.36]
2        (-0.255, 98.36]
3        (-0.255, 98.36]
4        (-0.255, 98.36]
              ...       
91145    (-0.255, 98.36]
91146    (-0.255, 98.36]
91147    (-0.255, 98.36]
91148    (-0.255, 98.36]
91149    (-0.255, 98.36]
Name: UnitPrice, Length: 91150, dtype: category
Categories (3, interval[float64, right]): [(-0.255, 98.36] < (98.36, 196.68] < (196.68, 295.0]]

Convert numerical values into categories based on custom bin edges.

In [50]:
# Custom cutpoints with labels:
# (0,5]... small
# (5, 10]... medium
# (10, inf]... large
pd.cut(df['UnitPrice'], 
       bins = [0, 5.00, 10.00, float("inf")], 
       labels = ["small", "medium", "large"])

0        small
1        small
2        small
3        small
4        small
         ...  
91145    small
91146    small
91147    small
91148    small
91149    small
Name: UnitPrice, Length: 91150, dtype: category
Categories (3, object): ['small' < 'medium' < 'large']

In [51]:
df['UnitPriceCategory'] = pd.cut(df['UnitPrice'], 
                                 bins = [0, 5.00, 10.00, float("inf")], 
                                 labels = ["small", "medium", "large"])

In [52]:
df.sort_values("UnitPrice")

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory
45424,14298,560828,2011-07-21 11:55:00,16045,100.0,0.04,4.0,16045,POPART WOODEN PENCILS ASST,small
51997,14414,564043,2011-08-22 12:55:00,16045,100.0,0.04,4.0,16045,POPART WOODEN PENCILS ASST,small
54904,12627,565442,2011-09-04 14:09:00,16045,100.0,0.04,4.0,16045,POPART WOODEN PENCILS ASST,small
63853,18033,569714,2011-10-05 17:28:00,16045,100.0,0.04,4.0,16045,POPART WOODEN PENCILS ASST,small
80298,12748,577057,2011-11-17 14:26:00,16045,100.0,0.04,4.0,16045,POPART WOODEN PENCILS ASST,small
...,...,...,...,...,...,...,...,...,...,...
25299,17142,550185,2011-04-14 18:22:00,22826,1.0,195.00,195.0,22826,LOVE SEAT ANTIQUE WHITE METAL,large
27616,14973,551393,2011-04-28 12:22:00,22656,1.0,295.00,295.0,22656,VINTAGE BLUE KITCHEN CABINET,large
12813,14842,543253,2011-02-04 15:32:00,22655,1.0,295.00,295.0,22655,VINTAGE RED KITCHEN CABINET,large
5605,16607,539080,2010-12-16 08:41:00,22655,1.0,295.00,295.0,22655,VINTAGE RED KITCHEN CABINET,large


#### Categorical To Numeric

https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

Create numerical category codes for the values in a categorical column and store the codes and corresponding category names in two separate arrays.

In [53]:
categories_codes, categories_names = pd.factorize(df['Description'])

In [54]:
categories_codes

array([   0,    1,    2, ..., 3244, 3272, 3273])

In [55]:
categories_names

Index(['ASSORTED COLOUR BIRD ORNAMENT', 'POPPY'S PLAYHOUSE BEDROOM ',
       'POPPY'S PLAYHOUSE KITCHEN', 'FELTCRAFT PRINCESS CHARLOTTE DOLL',
       'IVORY KNITTED MUG COSY ', 'BOX OF 6 ASSORTED COLOUR TEASPOONS',
       'BOX OF VINTAGE JIGSAW BLOCKS ', 'BOX OF VINTAGE ALPHABET BLOCKS',
       'HOME BUILDING BLOCK WORD', 'LOVE BUILDING BLOCK WORD',
       ...
       'PURPLE AMETHYST NECKLACE W TASSEL', 'MOROCCAN BEATEN METAL DISH',
       'EAU DE NILE JEWELLED PHOTOFRAME', 'BLACK SQUARE TABLE CLOCK',
       'CLASSICAL ROSE TABLE LAMP', 'CREAM DELPHINIUM ARTIFICIAL FLOWER',
       'BLUE CLIMBING HYDRANGA ART FLOWER',
       'CREAM CLIMBING HYDRANGA ART FLOWER', 'PINK SQUARE TABLE CLOCK',
       'BLING KEY RING STAND'],
      dtype='object', length=3294)

Define a function that takes a categorical value as input, retrieves its corresponding numeric code from predefined lists, and returns the numeric code as output.

In [56]:
def categorical_to_numeric(category, cat_codes = categories_codes, cat_names = categories_names):
  index_loc = categories_names.get_loc(category)
  encoded_category = categories_codes[index_loc]
  return(encoded_category)

Apply the function and save the output as a new column

In [57]:
df['DescriptionEncoded'] = df['Description'].map(categorical_to_numeric)
df

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,DescriptionEncoded
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,54.08,84879,ASSORTED COLOUR BIRD ORNAMENT,small,0
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745,POPPY'S PLAYHOUSE BEDROOM,small,1
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748,POPPY'S PLAYHOUSE KITCHEN,small,2
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,small,3
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310,IVORY KNITTED MUG COSY,small,4
...,...,...,...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562,SET OF 6 RIBBONS PERFECTLY PRETTY,small,282
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561,SET OF 6 RIBBONS PARTY,small,964
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681,LUNCH BAG RED VINTAGE DOILY,small,1227
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552,BICYCLE PUNCTURE REPAIR KIT,small,337


One-Hot Encoding

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Generate a dummy variable for each unique value in a column

In [58]:
pd.get_dummies(df['Description'])

Unnamed: 0,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE OR GIFT BAG LARGE SPOT,SET 2 TEA TOWELS I LOVE LONDON OR SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,...,ZINC HERB GARDEN CONTAINER OR METAL HERB GERDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE OR ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91145,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91146,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Enrichment

Create a new column by applying an arithmetic operation on other columns

In [59]:
df['Revenue'] = df['Quantity'] * df['UnitPrice']

In [60]:
df.head(2)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,DescriptionEncoded
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,,84879,ASSORTED COLOUR BIRD ORNAMENT,small,0
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.1,12.6,22745,POPPY'S PLAYHOUSE BEDROOM,small,1


In [61]:
df['Revenue_Double'] = df['Revenue'] * 2
df

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,DescriptionEncoded,Revenue_Double
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,,84879,ASSORTED COLOUR BIRD ORNAMENT,small,0,
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745,POPPY'S PLAYHOUSE BEDROOM,small,1,25.20
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748,POPPY'S PLAYHOUSE KITCHEN,small,2,25.20
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,small,3,60.00
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310,IVORY KNITTED MUG COSY,small,4,19.80
...,...,...,...,...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562,SET OF 6 RIBBONS PERFECTLY PRETTY,small,282,34.68
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561,SET OF 6 RIBBONS PARTY,small,964,34.68
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681,LUNCH BAG RED VINTAGE DOILY,small,1227,33.00
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552,BICYCLE PUNCTURE REPAIR KIT,small,337,24.96


### Reduction
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Delete a column from a dataframe based on its name

In [62]:
df = df.drop("DescriptionEncoded", axis = 1) #axis 1... columns, axis 0... rows
df

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
0,13047,536367,2010-12-01 08:34:00,84879,,1.69,,84879,ASSORTED COLOUR BIRD ORNAMENT,small,
1,13047,536367,2010-12-01 08:34:00,22745,6.0,2.10,12.60,22745,POPPY'S PLAYHOUSE BEDROOM,small,25.20
2,13047,536367,2010-12-01 08:34:00,22748,6.0,2.10,12.60,22748,POPPY'S PLAYHOUSE KITCHEN,small,25.20
3,13047,536367,2010-12-01 08:34:00,22749,8.0,3.75,30.00,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,small,60.00
4,13047,536367,2010-12-01 08:34:00,22310,6.0,1.65,9.90,22310,IVORY KNITTED MUG COSY,small,19.80
...,...,...,...,...,...,...,...,...,...,...,...
91145,17581,581581,2011-12-09 12:20:00,23562,6.0,2.89,17.34,23562,SET OF 6 RIBBONS PERFECTLY PRETTY,small,34.68
91146,17581,581581,2011-12-09 12:20:00,23561,6.0,2.89,17.34,23561,SET OF 6 RIBBONS PARTY,small,34.68
91147,17581,581581,2011-12-09 12:20:00,23681,10.0,1.65,16.50,23681,LUNCH BAG RED VINTAGE DOILY,small,33.00
91148,17581,581582,2011-12-09 12:21:00,23552,6.0,2.08,12.48,23552,BICYCLE PUNCTURE REPAIR KIT,small,24.96


#### Filtering

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Filter a dataframe given some criteria

In [63]:
df.query("Quantity > 1500")

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
972,16754,536830,2010-12-02 17:38:00,84077,2880.0,0.18,518.4,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,small,1036.8
58583,17450,567423,2011-09-20 11:05:00,23286,1878.0,1.08,2028.24,23286,BLUE VINTAGE SPOT BEAKER,small,4056.48
58584,17450,567423,2011-09-20 11:05:00,23288,1944.0,1.08,2099.52,23288,GREEN VINTAGE SPOT BEAKER,small,4199.04
58585,17450,567423,2011-09-20 11:05:00,23285,1944.0,1.08,2099.52,23285,PINK VINTAGE SPOT BEAKER,small,4199.04


Advanced filtering with logical operators:

`|` ... OR

`&` .... AND


In [64]:
df.query("Quantity > 1500 | Revenue > 7000" )

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
972,16754,536830,2010-12-02 17:38:00,84077,2880.0,0.18,518.4,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,small,1036.8
58583,17450,567423,2011-09-20 11:05:00,23286,1878.0,1.08,2028.24,23286,BLUE VINTAGE SPOT BEAKER,small,4056.48
58584,17450,567423,2011-09-20 11:05:00,23288,1944.0,1.08,2099.52,23288,GREEN VINTAGE SPOT BEAKER,small,4199.04
58585,17450,567423,2011-09-20 11:05:00,23285,1944.0,1.08,2099.52,23285,PINK VINTAGE SPOT BEAKER,small,4199.04
58592,17450,567423,2011-09-20 11:05:00,23243,1412.0,5.06,7144.72,23243,SET OF TEA COFFEE SUGAR TINS PANTRY,medium,14289.44


In [65]:
df.query("Quantity > 1500 & Revenue > 1000" )

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
58583,17450,567423,2011-09-20 11:05:00,23286,1878.0,1.08,2028.24,23286,BLUE VINTAGE SPOT BEAKER,small,4056.48
58584,17450,567423,2011-09-20 11:05:00,23288,1944.0,1.08,2099.52,23288,GREEN VINTAGE SPOT BEAKER,small,4199.04
58585,17450,567423,2011-09-20 11:05:00,23285,1944.0,1.08,2099.52,23285,PINK VINTAGE SPOT BEAKER,small,4199.04


Advanced filtering with indexing
https://pandas.pydata.org/docs/user_guide/indexing.html

In [66]:
customerIDs = ['17450', '16754']
filter = df['CustomerID'].isin(customerIDs)
df[filter]

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
972,16754,536830,2010-12-02 17:38:00,84077,2880.0,0.18,518.40,84077,WORLD WAR 2 GLIDERS ASSTD DESIGNS,small,1036.80
973,16754,536830,2010-12-02 17:38:00,21915,1400.0,1.06,1484.00,21915,RED HARMONICA IN BOX,small,2968.00
2824,17450,537448,2010-12-07 09:23:00,21756,6.0,6.60,39.60,21756,BATH BUILDING BLOCK WORD,medium,79.20
2825,17450,537456,2010-12-07 09:43:00,22469,402.0,1.93,775.86,22469,HEART OF WICKER SMALL,small,1551.72
2826,17450,537456,2010-12-07 09:43:00,22470,378.0,3.21,1213.38,22470,HEART OF WICKER LARGE,small,2426.76
...,...,...,...,...,...,...,...,...,...,...,...
87589,17450,580063,2011-12-01 13:29:00,82583,96.0,2.39,229.44,82583,HOT BATHS METAL SIGN,small,458.88
87590,17450,580063,2011-12-01 13:29:00,82600,96.0,2.39,229.44,82600,NO SINGING METAL SIGN OR N0 SINGING METAL SIGN,small,458.88
87591,17450,580063,2011-12-01 13:29:00,21174,144.0,2.39,344.16,21174,POTTERING IN THE SHED METAL SIGN,small,688.32
87592,17450,580063,2011-12-01 13:29:00,21166,288.0,2.39,688.32,21166,COOK WITH WINE METAL SIGN,small,1376.64


#### Sampling
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html


Sample a fixed number of random observations with random state

In [67]:
df.sample(n = 3)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
14133,13078,544169,2011-02-16 13:44:00,21871,12.0,1.25,15.0,21871,SAVE THE PLANET MUG,small,30.0
69226,17672,572082,2011-10-20 14:18:00,22999,24.0,0.42,10.08,22999,TRAVEL CARD WALLET RETRO PETALS OR TRAVEL CARD...,small,20.16
61338,12748,568703,2011-09-28 15:21:00,21832,2.0,1.65,3.3,21832,CHOCOLATE CALCULATOR,small,6.6


Sample a fixed number of random observations with random state

In [68]:
df.sample(n = 3, random_state = 123)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
75168,17920,574721,2011-11-06 14:43:00,85032B,1.0,0.65,0.65,85032,BLOSSOM IMAGES GIFT WRAP SET,small,1.3
83778,12748,578461,2011-11-24 12:30:00,23045,2.0,4.15,8.3,23045,PAPER LANTERN 5 POINT STAR MOON 30 OR PAPER LA...,small,16.6
39954,13656,558293,2011-06-28 10:31:00,23205,10.0,0.85,8.5,23205,CHARLOTTE BAG VINTAGE ALPHABET OR CHARLOTTE B...,small,17.0


Sample a percentage of random observations

In [69]:
df.sample(frac = 0.0001)

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue,StockCodeNumeric,Description,UnitPriceCategory,Revenue_Double
80068,17397,577049,2011-11-17 13:58:00,22865,7.0,2.1,14.7,22865,HAND WARMER OWL DESIGN,small,29.4
63081,14057,569483,2011-10-04 12:45:00,22468,4.0,6.75,27.0,22468,BABUSHKA LIGHTS STRING OF 10,medium,54.0
23706,17735,549281,2011-04-07 13:19:00,21913,4.0,3.75,15.0,21913,VINTAGE SEASIDE JIGSAW PUZZLES,small,30.0
67952,14298,571653,2011-10-18 12:17:00,22065,48.0,0.39,18.72,22065,CHRISTMAS PUDDING TRINKET POT,small,37.44
24292,14696,549693,2011-04-11 13:43:00,21078,12.0,0.85,10.2,21078,SET/20 STRAWBERRY PAPER NAPKINS,small,20.4
90913,17144,581451,2011-12-08 17:57:00,48187,2.0,8.25,16.5,48187,DOORMAT NEW ENGLAND,medium,33.0
736,17976,536749,2010-12-02 13:49:00,20975,1.0,0.65,0.65,20975,12 PENCILS SMALL TUBE RED RETROSPOT,small,1.3
4498,17126,538379,2010-12-12 11:26:00,22920,1.0,0.65,0.65,22920,HERB MARKER BASIL,small,1.3
6156,14146,539444,2010-12-17 15:52:00,21481,36.0,2.55,91.8,21481,FAWN BLUE HOT WATER BOTTLE,small,183.6


## Exporting Data
https://pandas.pydata.org/docs/reference/io.html

Save a dataframe to CSV without the row index

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [70]:
df.to_csv('data.csv', index=False)

Save a dataframe to a new Excel file without the row index

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

In [71]:
df.to_excel('data.xlsx', index=False)

Save a dataframe to an existing Excel file as a new worksheet without modifying other worksheets

In [72]:
with pd.ExcelWriter('data.xlsx', engine='openpyxl', mode='a') as writer:  
    df.to_excel(writer, sheet_name='New_Sheet')

Save a dataframe to a database via SQL

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

In [73]:
import sqlite3
# https://www.sqlite.org/about.html

# Connect to a SQL database
db_connection = sqlite3.connect('data.db')

# Write the dataframe to a SQL table
df.to_sql('my_table', con=db_connection, if_exists='append')


91150

## Bonus: Create a Data Preprocessing Script With ChatGPT

**Steps:**

1. Open chat.openai.com
2. Paste the following prompt
3. Edit according to your needs
4. Run and check the generated code

**Prompt:**

```
Task: Write Python code using Pandas 
Method: Step by Step
Output_Format: Code only
File_Name: "mydata.csv"
Steps:
- Open a file "retail_data_clean.csv"
- Store it in a Pandas DataFrame called "df".
- Check for missing values
- List the data type of each column
- Convert the column `CustomerID`  to a string data type
- Convert the column `InvoiceDate` to datetime format with automated parsing (overwrite)
- Sort the dataframe by Customer ID (asc) and then InvoiceNo (desc)
- Create a new column "Revenue" by multiplying the columns "UnitPrice" and "Quantity"
- Filter the dataframe using .query for Quantity > 1500
- Save the dataframe to Excel "output.xlsx"
```

In [75]:
import pandas as pd

# Open the file and store it in a dataframe
df = pd.read_csv("retail_data_clean.csv")

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# List the data type of each column
print("Data types:\n", df.dtypes)

# Convert the column CustomerID to a string data type
df["CustomerID"] = df["CustomerID"].astype(str)

# Convert the column InvoiceDate to datetime format with automated parsing (overwrite)
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"], infer_datetime_format=True)

# Sort the dataframe by Customer ID (asc) and then InvoiceNo (desc)
df = df.sort_values(by=["CustomerID", "InvoiceNo"], ascending=[True, False])

# Create a new column "Revenue" by multiplying the columns "UnitPrice" and "Quantity"
df["Revenue"] = df["UnitPrice"] * df["Quantity"]

# Filter the dataframe using .query for Quantity > 1500
df = df.query("Quantity > 1500")

# Save the dataframe to Excel "output.xlsx"
df.to_excel("output.xlsx", index=False)


Missing values:
 CustomerID     0
InvoiceNo      0
InvoiceDate    0
StockCode      0
Quantity       0
UnitPrice      0
Revenue        0
dtype: int64
Data types:
 CustomerID       int64
InvoiceNo        int64
InvoiceDate     object
StockCode       object
Quantity         int64
UnitPrice      float64
Revenue        float64
dtype: object


In [76]:
df

Unnamed: 0,CustomerID,InvoiceNo,InvoiceDate,StockCode,Quantity,UnitPrice,Revenue
972,16754,536830,2010-12-02 17:38:00,84077,2880,0.18,518.4
58583,17450,567423,2011-09-20 11:05:00,23286,1878,1.08,2028.24
58584,17450,567423,2011-09-20 11:05:00,23288,1944,1.08,2099.52
58585,17450,567423,2011-09-20 11:05:00,23285,1944,1.08,2099.52
