## Transforming text data
By the end of this lecture you will be able to:
- to modify text data in Polars
- to split text data
- to merge text columns to create a new column

In [1]:
import polars as pl

For this lecture we create a `DataFrame` of fake news articles

In [2]:
df = pl.DataFrame(
    {
        "publication": [
            "The Daily Deception",
            "Faux News Network",
            "The Fabricator",
            "The Misleader",
            "The Hoax Herald",
        ],
        "date": [
            "2022-01-01",
            "2022-01-03",
            "2022-01-04",
            "2022-01-05",
            "2022-01-06",
        ],
        "title": [
            "Scientists Discover New Species of Flying Elephant",
            "Aliens Land on Earth and Offer to Solve All Our Problems",
            "Study Shows That Eating Pizza Every Day Leads to Longer Life",
            "New Study Finds That Smoking is Good for You",
            "World's Largest Iceberg Discovered in Florida",
        ],
        "text": [
            "In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.",
            "In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.",
            "A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn't. The study has been hailed as a breakthrough in the field of nutrition.",
            "In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn't. The findings have sparked controversy among health experts.",
            "In a bizarre turn of events, the world's largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have",
        ],
    }
)

We set how many string characters should be printed per column with `set_fmt_str_lengths`

In [3]:
pl.Config.set_fmt_str_lengths(100)

polars.config.Config

## The `.str` namespace
Polars has a `.str` namespace to group string expressions together.

We can see the full set of string methods in the API pages:
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/string.html

## Changing case
We can change case of all letters with `str.to_lowercase` and `str.to_uppercase`

In [4]:
(
    df
    .select(
        pl.col('title').str.to_uppercase()
    )
    .head(2)
)

title
str
"""SCIENTISTS DISCOVER NEW SPECIES OF FLYING ELEPHANT"""
"""ALIENS LAND ON EARTH AND OFFER TO SOLVE ALL OUR PROBLEMS"""


## Length of strings
We can get the length of a string either as the number of characters with `len_chars` or as the number or bytes with `len_bytes`

In [5]:
(
    df
    .select(
        len_chars = pl.col('title').str.len_chars(),
        len_bytes = pl.col('title').str.len_bytes(),
    )
)

len_chars,len_bytes
u32,u32
50,50
56,56
60,60
44,44
45,45


In this example we see how these metrics differ using the names of some Bon Iver songs with unicode characters

In [6]:
(
    pl.DataFrame(
        {'title':['Holocene','22 (OVER S∞∞N)']}
                  )
    .select(
        len_chars = pl.col('title').str.len_chars(),
        len_bytes = pl.col('title').str.len_bytes(),
    )
)

len_chars,len_bytes
u32,u32
8,8
14,18


## Remove whitespace

We can remove leading whitespace with `strip_chars_start`. Here we define a `DataFrame` with all types of whitespace

In [8]:
(
    pl.DataFrame(
        {"foo": [" lead", "trail ", " both "]}
    )
    .select(
        pl.col("foo").str.strip_chars_start()
    )
)

foo
str
"""lead"""
"""trail """
"""both """


We can use `strip_chars` to remove leading and trailing whitespace or strip_chars_end to remove trailing whitespace

## Justify and padding
We can return a string justified to a certain length with a padding character.

In this example we left-justify to 6 characters (including whitespace) and pad with a `*`

In [10]:
(
    pl.DataFrame(
        {
            "foo": [" lead", "trail ", " both "]
        }
    )
    .select(
        "foo",
        left_justified = pl.col("foo").str.pad_end(10,"*"),
    )
)

foo,left_justified
str,str
""" lead""",""" lead*****"""
"""trail ""","""trail ****"""
""" both """,""" both ****"""


And we can apply zero-padding with `zfill`

In [11]:
(
    pl.DataFrame(
        {
            "foo": ["1", "10", "100"]
        }
    )
    .select(
        pl.col("foo").str.zfill(3)
    )
)

foo
str
"""001"""
"""010"""
"""100"""


## Splitting text

We can split text into a `pl.List` dtype column with the `str.split` method.

In this example we split the text column based on whitespace

In [12]:
pl.Config.set_tbl_rows(30)
(
    df
    .with_columns(
        pl.col('text').str.split(' ')
    )
)

publication,date,title,text
str,str,str,list[str]
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","[""In"", ""a"", … ""zoology.""]"
"""Faux News Network""","""2022-01-03""","""Aliens Land on Earth and Offer to Solve All Our Problems""","[""In"", ""a"", … ""out.""]"
"""The Fabricator""","""2022-01-04""","""Study Shows That Eating Pizza Every Day Leads to Longer Life""","[""A"", ""new"", … ""nutrition.""]"
"""The Misleader""","""2022-01-05""","""New Study Finds That Smoking is Good for You""","[""In"", ""a"", … ""experts.""]"
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","[""In"", ""a"", … ""have""]"


If we want to do further analysis on the individual words it is often easiest to then `explode` the list column to have an entry for each row

In [13]:
pl.Config.set_tbl_rows(6)
(
    df
    .with_columns(
        pl.col('text').str.split(' ')
    )
    .explode('text')
)

publication,date,title,text
str,str,str,str
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""In"""
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""a"""
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""groundbreaking"""
…,…,…,…
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""size"""
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""could"""
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""have"""


If we want to split by a `regex` instead we use the `str.extract_all` method that we meet in the next Lecture.

The output in this case keeps the data in the other columns. For a large `DataFrame` we could reduce memory usage by casting these to categorical as they have many repeated values

In [14]:
(
    df
    .with_columns(
        pl.col(["publication","title"]).cast(pl.Categorical),
        pl.col('text').str.split(' ')
    )
    .explode('text')
)

publication,date,title,text
cat,str,cat,str
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""In"""
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""a"""
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""groundbreaking"""
…,…,…,…
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""size"""
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""could"""
"""The Hoax Herald""","""2022-01-06""","""World's Largest Iceberg Discovered in Florida""","""have"""


With the `explode` method we can now do word-level analysis.

In this example we count how often each word occurs (we learn more about `value_counts` in the next Section) 

In [15]:
(
    df
    .with_columns(
        pl.col('text').str.split(' ')
    )
    .explode('text')
    ['text']
    .value_counts(sort=True)
)

text,count
str,u32
"""a""",14
"""of""",13
"""that""",9
…,…
"""out""",1
"""an""",1
"""size""",1


We can also explode a string column to have each character on its own line

In [16]:
(
    df
    .select(
        pl.col('publication').str.split("").explode()
    )
    .head(6)
)

publication
str
"""T"""
"""h"""
"""e"""
""" """
"""D"""
"""a"""


If we want to split a regular pattern with the same number of splits on each row we can use `str.split_exact`. See the exercises for an example.

## Merging string columns to create a new column
We can merge string columns with the `pl.concat_str` function

In [17]:
(
    df
    .with_columns(
        title_date = pl.concat_str(
            [
                pl.col('title'),
                pl.col('date').cast(pl.Utf8)
            ],
            separator="_"
        )
    )
    .head(2)
)

publication,date,title,text,title_date
str,str,str,str,str
"""The Daily Deception""","""2022-01-01""","""Scientists Discover New Species of Flying Elephant""","""In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The fly…","""Scientists Discover New Species of Flying Elephant_2022-01-01"""
"""Faux News Network""","""2022-01-03""","""Aliens Land on Earth and Offer to Solve All Our Problems""","""In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems…","""Aliens Land on Earth and Offer to Solve All Our Problems_2022-01-03"""


# Exercises

## Exercise 1
You have been given the following string data with formatting errors.

Clean the data so that 
- the data in the `id` column is homogenous with values `A` and `B`
- you can sort the `DataFrame` by zero-padded strings in the `values` column (without casting to integers)
- add a column to count how many characters there are in the `values` column

In [None]:
(
    pl.DataFrame(
        {
            "id": ["A","B","a","b"],
            "values": ["20","5"," 13","40"],
        })
    <blank>
    .sort('values')
)

### Exercise 2
Clean the `origin` column of this `DataFrame` so that you can count how many records come from each city

In [None]:
df_origin = pl.DataFrame(
    [
        {"origin": "New York   ", "age": 25},
        {"origin": "Los Angeles", "age": 31},
        {"origin": "  miami", "age": 47},
        {"origin": "  Chicago  ", "age": 19},
        {"origin": "   boston   ", "age": 55},
        {"origin": " New York   ", "age": 28},
        {"origin": "los Angeles", "age": 11},
        {"origin": "Miami", "age": 27},
        {"origin": "  chicago  ", "age": 31},
        {"origin": "  Boston   ", "age": 45},
        {"origin": "new york", "age": 25},
    ]
)

The output should look like this:

In [None]:
pl.DataFrame(
    [
        {"origin": "new york", "counts": 3},
        {"origin": "los angeles", "counts": 2},
        {"origin": "miami", "counts": 2},
        {"origin": "chicago", "counts": 2},
        {"origin": "boston", "counts": 2},
    ]
)


### Exercise 3

Clean and then justify the text to have 4-digit years.

Hint: you can only justify by one character at a time. Examine the data carefully!

In [None]:
(
    pl.DataFrame(
        {"year": ["2022", "21", "22 "]}
    )
    .select(
        <blank>
    )
)

### Exercise 4
Split the `id` column into a `pl.Struct` column called `struct_col` with 3 fields. 

In [None]:
(
    pl.DataFrame(
        [
            {"id": "AAA-BBB-2"},
            {"id": "AAA-BBB-3"},
            {"id": "AAA-CCC-2"},
            {"id": "AAA-DDD-3"},
            {"id": "AAA-BBB-4"},
        ]
    )
    .with_columns(
        struct_col = <blank>
    )
)

Convert the struct fields into columns of the `DataFrame`

## Exercise 5
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Let's find out what makes for long track titles.

- Keep one row for every unique track (with uniqueness defined by title and artist) 
- Add columns with the length of the title column in characters (`len_chars`) and bytes (`len_chars`)
- Find the 10 tracks with the longest titles by number of characters

In [None]:
(
    spotify_df
    <blank>
    .select("title","artist","len_chars","len_bytes")
)

When do we get the biggest difference between the title representation in characters and bytes?

- Add a column called `diff` with the difference in the number of bytes and characters in the title
- Keep only tracks where the difference is greater than 0
- Show the 10 tracks with the biggest difference

## Solutions
### Solution to Exercise 1
You have been given the following string data with formatting errors.

Clean the data so that 
- the data in the `id` column is homogenous with values `A` and `B`
- you can sort the `DataFrame` by zero-padded strings in the values column (without casting to integers)

In [None]:
(
    pl.DataFrame(
        {
            "id": ["A","B","a","b"],
            "values": ["20","5"," 13","40"],
        })
    .with_columns(
        pl.col('id').str.to_uppercase(),
        pl.col('values').str.strip_chars().str.zfill(2)
    )       
    .with_columns(
        pl.col("values").str.len_chars().alias("len_chars")
    )
    .sort('values')
).to_dicts()

### Solution to exercise 2
Clean the `origin` column of this `DataFrame` so that you can count how many records come from each city

In [None]:
df_origin = pl.DataFrame(
    [
        {"origin": "New York   ", "age": 25},
        {"origin": "Los Angeles", "age": 31},
        {"origin": "  miami", "age": 47},
        {"origin": "  Chicago  ", "age": 19},
        {"origin": "   boston   ", "age": 55},
        {"origin": " New York   ", "age": 28},
        {"origin": "los Angeles", "age": 11},
        {"origin": "Miami", "age": 27},
        {"origin": "  chicago  ", "age": 31},
        {"origin": "  Boston   ", "age": 45},
        {"origin": "new york", "age": 25},
    ]
)

In [None]:
(
    df_origin
    .with_columns(
        pl.col("origin").str.strip_chars().str.to_lowercase()
    )
    ["origin"]
    .value_counts(sort=True)
)

### Solution to exercise 3

Clean and then justify the text to have 4-digit years.


In [None]:
(
    pl.DataFrame(
        {"year": ["2022", "21", "22 "]}
    )
    .select(
        pl.col("year").str.strip_chars().str.pad_start(3,"0").str.pad_start(4,"2")
    )
)

### Solution to exercise 4
Split the `id` column into a `pl.Struct` column with 3 fields. 

In [None]:
(
    pl.DataFrame(
        [
            {"id": "AAA-BBB-2"},
            {"id": "AAA-BBB-3"},
            {"id": "AAA-CCC-2"},
            {"id": "AAA-DDD-3"},
            {"id": "AAA-BBB-4"},
        ]
    )
    .with_columns(
        struct_col = pl.col("id").str.split_exact("-",2)
    )
)

Convert the struct fields into columns of the `DataFrame`

In [None]:
(
    pl.DataFrame(
        [
            {"id": "AAA-BBB-2"},
            {"id": "AAA-BBB-3"},
            {"id": "AAA-CCC-2"},
            {"id": "AAA-DDD-3"},
            {"id": "AAA-BBB-4"},
        ]
    )
    .with_columns(
        struct_col = pl.col("id").str.split_exact("-",2)
    )
    .unnest('struct_col')
)

### Solution to exercise 5 

Create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Let's find out what makes for long track titles.

- Keep one row for every unique track (with uniqueness defined by title and artist) 
- Add columns with the length of the title column in characters (`len_chars`) and bytes (`len_chars`)
- Find the 10 tracks with the longest titles by number of characters

In [None]:
(
    spotify_df
    .unique(subset=["title","artist"])
    .with_columns(
        len_chars = pl.col("title").str.len_chars(),
        len_bytes = pl.col("title").str.len_bytes(),
    )
    .sort("len_chars",descending=True)
    .head(10)
    .select("title","artist","len_chars","len_bytes")
)

When do we get the biggest difference between the title representation in characters and bytes?

- Add a column called `diff` with the difference in the number of bytes and characters in the title
- Keep only tracks where the difference is greater than 0
- Show the 10 tracks with the biggest difference

In [None]:
(
    spotify_df
    .unique(subset=["title","artist"])
    .with_columns(
        len_chars = pl.col("title").str.len_chars().cast(pl.Int32),
        len_bytes = pl.col("title").str.len_bytes().cast(pl.Int32),
    )
    .with_columns(
        diff = pl.col("len_bytes") - pl.col("len_chars")
    )
    .filter(pl.col("diff") > 0)
    .sort("diff",descending=True)
    .head(3)
    .select("title","artist","len_chars","len_bytes","diff")
)