### Task 1

Load the text data file named magic.txt. It contains the description of 1419 Magic: The Gathering cards. This description contains crucial information about each card, such as its name, its mana cost, its type and its effect. Your task will be to extract all crucial information from this data set of unstructured text and turn it into a well-structured data format.

This description contains crucial information about each card, such as its CardName, CardCost, CardType, CardEffect.

#### **Breakdown of Each Part:**

Regex Component

Explanation

`CardName:`

Matches the fixed text `"CardName: "` exactly.

`(.*?)`

Captures **CardName** (non-greedy match to stop at the next `CardCost:`).

`CardCost:`

Matches the fixed text `"CardCost: "`.

`(.*?)`

Captures **CardCost** (stops at the next `CardType:`).

`CardType:`

Matches the fixed text `"CardType: "`.

`(.*?)`

Captures **CardType** (stops at the next `CardEffect:`).

`CardEffect:`

Matches the fixed text `"CardEffect: "`.

`(.*)`

1.  **Capturing Groups `(.*?)`**:
    -   **`.*?` (non-greedy match)** ensures it captures only up to the next fixed text (`CardCost:`, `CardType:`).
2.  **Fixed Text Matching**:
    -   We explicitly write `"CardName: "`, `"CardCost: "`, etc., so that our pattern finds each field reliably.
3.  **Handling Variable-Length Fields**:
    -   The last field `CardEffect` uses `(.*)` without `?` to capture the **remaining** text.

In [1]:
# Task 1: Load the file into Python
file_path = "magic.txt"  # Replace with the actual path if needed

# Read the file content
with open(file_path, "r", encoding="utf-8") as file:
    data = file.read()

# Print the first few characters to check if the file is loaded correctly
print(data[:1000])  # Display only the first 500 characters

CardName: Absorb CardCost: {W}{U}{U} CardType: Instant CardEffect: Counter target spell. You gain 3 life.
CardName: Acclaimed Contender CardCost: {2}{W} CardType: Creature — Human Knight CardEffect: When Acclaimed Contender enters the battlefield, if you control another Knight, look at the top five cards of your library. You may reveal a Knight, Aura, Equipment, or legendary artifact card from among them and put it into your hand. Put the rest on the bottom of your library in a random order.
CardName: Act of Treason CardCost: {2}{R} CardType: Sorcery CardEffect: Gain control of target creature until end of turn. Untap that creature. It gains haste until end of turn. (It can attack and {T} this turn.)
CardName: Aerial Assault CardCost: {2}{W} CardType: Sorcery CardEffect: Destroy target tapped creature. You gain 1 life for each creature you control with flying.
CardName: Aeromunculus CardCost: {1}{G}{U} CardType: Creature — Homunculus Mutant CardEffect: Flying {2}{G}{U}: Adapt 1. (If th

In [2]:
# # Define the regex pattern
# pattern = r'CardName: (.*?) CardCost: (.*?) CardType: (.*?) CardEffect: (.*)'

# # Set the file path
# file_path = "magic.txt" 

# # Initialize a list to store extracted data
# cards = []

# with open(file_path, "r", encoding="utf-8") as file:
#     for line in file:
#         match = re.match(pattern, line.strip())  # Remove leading/trailing whitespace
#         if match:
#             card_name, card_cost, card_type, card_effect = match.groups()
#             cards.append({
#                 "CardName": card_name,
#                 "CardCost": card_cost,
#                 "CardType": card_type,
#                 "CardEffect": card_effect
#             })
    
# # Create DataFrame
# df = pd.DataFrame(cards, columns=["CardName", "CardCost", "CardType", "CardEffect"])

# # Display DataFrame
# print(df.head())


### Task 2

Each line in the document represents the information about one card. Split the lines (separator ”\n”) to be able to look at each card individually. The result should be a list of strings /a character vector

In [3]:
# Task 2: Split data into lines
lines = data.split("\n")  # Split the text using newline character

# Print the first few lines to verify
print(lines[:5])  # Show first 5 lines


['CardName: Absorb CardCost: {W}{U}{U} CardType: Instant CardEffect: Counter target spell. You gain 3 life.', 'CardName: Acclaimed Contender CardCost: {2}{W} CardType: Creature — Human Knight CardEffect: When Acclaimed Contender enters the battlefield, if you control another Knight, look at the top five cards of your library. You may reveal a Knight, Aura, Equipment, or legendary artifact card from among them and put it into your hand. Put the rest on the bottom of your library in a random order.', 'CardName: Act of Treason CardCost: {2}{R} CardType: Sorcery CardEffect: Gain control of target creature until end of turn. Untap that creature. It gains haste until end of turn. (It can attack and {T} this turn.)', 'CardName: Aerial Assault CardCost: {2}{W} CardType: Sorcery CardEffect: Destroy target tapped creature. You gain 1 life for each creature you control with flying.', 'CardName: Aeromunculus CardCost: {1}{G}{U} CardType: Creature — Homunculus Mutant CardEffect: Flying {2}{G}{U}: A

### Task 3

The information about each card is given in the following format: CardName: [...] CardCost: [...] CardType: [...] CardEffect: [...] Exploit this format to extract and save each bit of information separately. Turn the information you collected into a coherent data frame with the columns “Name”, “Cost”, “Type” and “Effect”.

In [4]:
import pandas as pd
import re

# Task 3: Define regex pattern for extracting card details
pattern = r'CardName: (.*?) CardCost: (.*?) CardType: (.*?) CardEffect: (.*)'

# Initialize an empty list to store structured data
cards = []

# Extract information using regex
for line in lines:
    match = re.match(pattern, line.strip())  # Remove extra spaces
    if match:
        card_name, card_cost, card_type, card_effect = match.groups()
        cards.append({"Name": card_name, "Cost": card_cost, "Type": card_type, "Effect": card_effect})

# Convert list of dictionaries into a DataFrame
df = pd.DataFrame(cards)

# Display the first few rows
df.head()

Unnamed: 0,Name,Cost,Type,Effect
0,Absorb,{W}{U}{U},Instant,Counter target spell. You gain 3 life.
1,Acclaimed Contender,{2}{W},Creature — Human Knight,When Acclaimed Contender enters the battlefiel...
2,Act of Treason,{2}{R},Sorcery,Gain control of target creature until end of t...
3,Aerial Assault,{2}{W},Sorcery,Destroy target tapped creature. You gain 1 lif...
4,Aeromunculus,{1}{G}{U},Creature — Homunculus Mutant,Flying {2}{G}{U}: Adapt 1. (If this creature h...


### Task 4

Which of the words “Creature”, “Sorcery”, “Instant”, “Enchantment” and “Artifact” appears most often in all the texts within the “Type” column?

In [5]:
# Task 4: Count occurrences of specific words in the "Type" column
word_list = ["Creature", "Sorcery", "Instant", "Enchantment", "Artifact"]
word_counts = {word: 0 for word in word_list}  # Initialize a dictionary for counting

# Count occurrences in the "Type" column
for card_type in df["Type"]:
    for word in word_list:
        if word in card_type:
            word_counts[word] += 1

# Find the most frequently occurring word
most_common_word = max(word_counts, key=word_counts.get)

# Display the results
for word, count in word_counts.items():
    print(f"{word}: {count} occurrences")
print("Most Common Word:", most_common_word)


Creature: 733 occurrences
Sorcery: 185 occurrences
Instant: 215 occurrences
Enchantment: 85 occurrences
Artifact: 108 occurrences
Most Common Word: Creature
