<a target="_blank" href="https://colab.research.google.com/github/roitraining/Rackspace-Python/blob/main/Rackspace_Python.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

* All of us have repetitive tasks in our work day.
  * Some of these can be automated. 
  * We all understand how Excel macros work to allow automate tasks in a spreadsheet. 
* Python is an ideal choice because:
    * It's one of the easiest languages to learn
    * It's flexible enough to be used in a lot of different situations
    * It's free, open source, cross platform and has an extensive library of community created add on modules known as packages


### Let's use <a href="https://www.python.org">Python</a> for automating tasks

### Some common use cases Python can be used to automate would be:
* Data processing, transformation (ETL), engineering and analysis
* Big Data procesing
* Machine Learning and AI
* OS and administrative installation and maintenance routines
* Web Scraping
* Internet of Things 
* Testing
* Mocking
* Many more ...



##### Let's see a simple example of how to read the contents of a file

In [2]:
with open("regions.txt") as i:
    data = i.read()
    print(data)

RegionID,RegionName
1,North
2,South
3,East
4,West



##### Let's improve on that a little to make it transform that data by uppercasing the regions and write it to a new file.

In [3]:
with open("regions.txt") as i:
    data = i.read()
    with open("upper_regions.txt", "w") as o:
        o.write(data.upper())

# Let's just see what it looks like
with open("upper_regions.txt") as o:
    print(o.read())


REGIONID,REGIONNAME
1,NORTH
2,SOUTH
3,EAST
4,WEST



##### We can get much fancier than that using a common data processing module called <a href="https://pandas.pydata.org">Pandas</a>.
###### First we need to install this package and the <font color='blue' face="Courier New" size="+3">pip</font> utility will help us to do that.

In [36]:
! pip install pandas

[33mDEPRECATION: Loading egg at /Users/joey/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tk-0.1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [2]:
import pandas as pd

# Read CSV file with headers
df = pd.read_csv("data/territories_headers.csv")

# Filter based on a specific field (e.g., column "Name" with value "John Doe")
filtered_df = df[df["RegionID"] == 1]

display(filtered_df)

Unnamed: 0,TerritoryID,TerritoryName,RegionID
0,1581,Westboro,1
1,1730,Bedford,1
2,1833,Georgetow,1
3,2116,Boston,1
4,2139,Cambridge,1
5,2184,Braintree,1
6,2903,Providence,1
9,6897,Wilton,1
10,7960,Morristown,1
11,8837,Edison,1


* Eventually we might get files that are so big they can't be processed on a single machine. 
* Fortunately people had this problem before us and solved it by creating a package like Pandas called <a href="https://spark.apache.org/docs/latest/api/python/index.html">Spark</a> which is able to scale to multiple worker machines in a cluster and handle Big Data sized workloads of multiple TB and PB.
* The code will look a little different but still pretty much the same concept 

In [None]:
! pip install pyspark

In [9]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("CSV Filtering").getOrCreate()

# Read CSV file with headers
df = spark.read.csv("data/territories_headers.csv", header=True, inferSchema=True)

# Filter based on the RegionID field
filtered_df = df.filter(F.col("RegionID") == 1)

filtered_df.toPandas()

Unnamed: 0,TerritoryID,TerritoryName,RegionID
0,1581,Westboro,1
1,1730,Bedford,1
2,1833,Georgetow,1
3,2116,Boston,1
4,2139,Cambridge,1
5,2184,Braintree,1
6,2903,Providence,1
7,6897,Wilton,1
8,7960,Morristown,1
9,8837,Edison,1


* Once we've accumulated the multiple TB's of data and learned how to manipulate it with Big Data, the next step is often to use it for Machine Learning (ML)
* ML is all about finding patterns in data to create a model that can predict future values based on the past patterns
* Python is the dominant language in this field so you don't need to learn a whole new language, just some new packages and libraries


In [38]:
! pip install scikit-learn

[33mDEPRECATION: Loading egg at /Users/joey/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tk-0.1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [31]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# load the CSV data
data = pd.read_csv("data/credit_card_data.csv")

preprocessor = ColumnTransformer(
    transformers=[
        ('day_of_week', OneHotEncoder(), ['day_of_week']),
        ('store_type', OneHotEncoder(), ['store_type']),
        ('online_or_inperson', OneHotEncoder(), ['online_or_inperson'])
    ]
)
# Create the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression())])

# Split into features and labels
X = data.drop('is_valid', axis=1)
y = data['is_valid']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline to the training data
model = pipeline.fit(X_train, y_train)

new_rows = pd.DataFrame([{
    "purchase_amount": 500,
    "time_of_day": 15,
    "day_of_week": "Tuesday",
    "store_type": "Grocery",
    "online_or_inperson": "In-Person",
}, {
    "purchase_amount": 190,
    "time_of_day": 9,
    "day_of_week": "Friday",
    "store_type": "Restaurant",
    "online_or_inperson": "Online",
}
])

prediction = model.predict(new_rows)

print(prediction)

[1 0]


* Normally you'd have real data for this, but you can even uses Python and another community-built free library to generate the fake data I used to build this model.
* This is sometimes called mocking, because we will create mock data.
* Yet another automation task I can use Python for and of course there's a popular community package to help us do this called <a href="https://faker.readthedocs.io/en/master/">Faker</a>.

In [35]:
! pip install faker

[33mDEPRECATION: Loading egg at /Users/joey/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tk-0.1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [25]:

import pandas as pd
from faker import Faker

# Create a Faker instance
fake = Faker()

# Generate 1000 rows of sample data
data = []
for _ in range(1000):
    data.append({
        'purchase_amount': fake.random_int(min=1, max=1000),
        'day_of_week': fake.day_of_week(),
        'time_of_day': fake.time(),
        'time_of_day': fake.random_int(0, 23),
        'online_or_inperson': fake.random_element(elements=['Online', 'In-Person']),
        'is_valid': fake.random_element(elements=[0, 1])  # 0 for fraudulent, 1 for valid
    })

# Create a Pandas DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data/credit_card_data2.csv', index=False)

* Let's look at automating OS admin tasks
* If you're using Windows you might use Powershell scripts to do this
* For Linux you might use bash shell scripts
* Python is better when the tasks are a little more complex
* Also you can use the same scripts on any OS
* In this example let's say we have a folder full of files and we want to identify all the ones that end with _archive and move them to another folder.

In [None]:
import sys

def move_archive_files(source_folder, destination_folder):
  """Moves all CSV files ending with '_archive' from the source folder to the destination folder."""

  for file_name in os.listdir(source_folder):
    if file_name.endswith('_archive.csv'):
      source_file = os.path.join(source_folder, file_name)
      destination_file = os.path.join(destination_folder, file_name)
      shutil.move(source_file, destination_file)
      print(f"Moved {file_name} to {destination_folder}")

if __name__ == "__main__":
  if len(sys.argv) != 3:
    print("Usage: python script.py <source_folder> <destination_folder>")
    sys.exit(1)

  source_folder = sys.argv[1]
  destination_folder = sys.argv[2]

  move_archive_files(source_folder, destination_folder)

* Sometimes you find a website that has some information on it that we'd like to automate. 
* Many times the website owner will make it easy to get that through a web service.
* A lot of the time they don't, so we can go through a process known as web scraping to try to find that content in the web page and extract it.
* There's many packages that can help to do this, but the most popular is called <a href="https://pypi.org/project/beautifulsoup4/">Beautiful Soup</a>.


In [39]:
! pip install BeautifulSoup4

[33mDEPRECATION: Loading egg at /Users/joey/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tk-0.1.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [34]:
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.x-rates.com/calculator/?from=GBP&to=USD&amount=1')
soup = BeautifulSoup(page.text, 'html.parser')

part1 = soup.find(class_="ccOutputRslt").get_text()
print(part1)


1.306905 USD


* There's so many of these open source packages that do almost anything you might image.
* There's a whole site that is a repository full of them that you can easily search through and see if some nice person has already solved your problem.
* <a href="http://pypi.org"> PyPi</a>
