### Overview of Python and Its Role in Big Data

Python is a high-level, open-source programming language widely used in Big Data analytics due to its simplicity, flexibility, and rich ecosystem of libraries.

#### Why Python is Popular in Big Data

1. Ease of Use
     Simple syntax and readability make Python easy to learn and use.
     Reduces development time compared to low-level languages like Java or C++.

2. Extensive Library Ecosystem
    Python offers powerful libraries for handling large-scale data:

    NumPy â€“ numerical computing

    Pandas â€“ data manipulation and analysis

    PySpark â€“ Python API for Apache Spark

    Dask â€“ parallel computing for large datasets

    SciPy â€“ scientific and statistical analysis

3. Integration with Big Data Frameworks

    Works seamlessly with Hadoop, Spark, Hive, and HDFS.

    PySpark allows distributed data processing across clusters.

4. Support for Data Analytics and Machine Learning

    Widely used in data mining, predictive analytics, and AI.

    Libraries such as Scikit-learn, TensorFlow, and PyTorch support large-scale model development.

5. Scalability and Performance

    Although Python itself is not the fastest language, it scales efficiently by leveraging distributed systems and optimized backend engines written in C/C++.

6. Strong Community and Industry Adoption

    Backed by a large global community.

    Used by companies like Google, Netflix, Amazon, and Facebook for big data solutions.

### Role of Python in the Big Data Lifecycle

1. Data Ingestion: Reading data from multiple sources (databases, APIs, logs).

2. Data Processing: Cleaning, transforming, and aggregating massive datasets.

3. Data Analysis: Statistical analysis and pattern discovery.

4. Data Visualization: Tools like Matplotlib and Seaborn for insights.

5. Machine Learning: Building scalable predictive models.

#### Setting Up Python Environment
ðŸ”¹ Anaconda (Recommended for Data Science)

Anaconda is a Python distribution that comes with Python, Jupyter Notebook, and popular data science libraries preinstalled.

Download Anaconda:
ðŸ‘‰ https://www.anaconda.com/download

Includes: Python, NumPy, Pandas, Matplotlib, Scikit-learn, Jupyter

Useful for Big Data, ML, and analytics

Environment management using conda

ðŸ”¹ Local IDE Options

Jupyter Notebook (included in Anaconda)
ðŸ‘‰ https://jupyter.org/


#### Cloud-Based Python Options (No Installation)
âœ… Google Colab

Free cloud-based Python environment

Preinstalled data science & ML libraries

Supports GPU/TPU

ðŸ‘‰ https://colab.research.google.com/

### Data Types

Data types define the kind of data a variable can store in Python.

Common built-in data types:

int â€“ Integer values (e.g., 10, -5)

float â€“ Decimal numbers (e.g., 3.14)

complex â€“ Complex numbers (e.g., 2+3j)

str â€“ Text or strings (e.g., "Python")

bool â€“ Boolean values (True, False)

list â€“ Ordered, mutable collection (e.g., [1, 2, 3])

tuple â€“ Ordered, immutable collection (e.g., (1, 2, 3))

set â€“ Unordered, unique elements (e.g., {1, 2, 3})

dict â€“ Key-value pairs (e.g., {"id": 1, "name": "AI"})

In [1]:
#integer
x = 10
y = -5
print(type(x))

<class 'int'>


In [2]:
#float
pi = 3.14
print(type(pi))

<class 'float'>


In [3]:
#complex
z = 2 + 3j
print(type(z))


<class 'complex'>


In [4]:
#string
name = "Python"
print(name.upper())

PYTHON


In [5]:
#boolen
is_valid = True
print(type(is_valid))


<class 'bool'>


##### Variables

Variables are used to store data values in memory.

No need to declare data type explicitly

Type is assigned dynamically

In [6]:
x = 10
name = "Python"
is_active = True
print(name)

Python


##### Operators

Operators are used to perform operations on variables and values.

Types of Operators:

Arithmetic Operators
+,  - , * / % ** //

Relational (Comparison) Operators
== != > < >= <=

Assignment Operators
= += -= *= /=

Logical Operators
and or not

Bitwise Operators
& | ^ ~ << >>

Membership Operators
in not in

Identity Operators
is is not

In [7]:
#arithematic
a = 10
b = 3
print(a + b, a * b, a%b)


13 30 1


In [8]:
#comparison
print(a > b)
print(a == b)

True
False


In [9]:
#assignment operators
a += 5
print(a)

15


In [10]:
# logical operators
x = True
y = False
print(x and y)


False


### Homework - Due Next Day Before Lecture

Write a Python program that does the following:

Store the following information using appropriate data types:

Name

Age

Height

Is the student enrolled? (True/False)

Print each value along with its data type.

### List
A list is a built-in data structure used to store a collection of items in a single variable. Lists are ordered, mutable (changeable), and allow duplicate values.
Lists are defined by placing elements inside square brackets [], separated by commas.

In [11]:
# Empty list
empty_list = []
## empty_list = list()

# List with values
numbers = [1, 2, 3, 4, 5]
mixed = [1, "hello", 3.14, True]
nested = [[1, 2], [3, 4], [5, 6]]

# List from range
numbers = list(range(10))  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# List from string
chars = list("hello")  # ['h', 'e', 'l', 'l', 'o']

# List comprehension
squares = [x**2 for x in range(10)]
evens = [x for x in range(20) if x % 2 == 0]

In [12]:
#accessing a list
fruits = ["apple", "banana", "cherry", "date", "elderberry"]

# Index (0-based)
print(fruits[0])      # "apple"
print(fruits[2])      # "cherry"
print(fruits[-1])     # "elderberry" (last item)
print(fruits[-2])     # "date" (second from end)

# Slicing [start:end:step]
print(fruits[1:3])    # ["banana", "cherry"]
print(fruits[:3])     # ["apple", "banana", "cherry"]
print(fruits[2:])     # ["cherry", "date", "elderberry"]
print(fruits[::2])    # ["apple", "cherry", "elderberry"] (every 2nd)
print(fruits[::-1])   # Reverse list

# Length
print(len(fruits))    # 5

apple
cherry
elderberry
date
['banana', 'cherry']
['apple', 'banana', 'cherry']
['cherry', 'date', 'elderberry']
['apple', 'cherry', 'elderberry']
['elderberry', 'date', 'cherry', 'banana', 'apple']
5


In [15]:
# Changing elements
fruits[0] = "apricot"
fruits[1:3] = ["blueberry", "coconut"]

# Append (add to end)
fruits.append("fig")

# Insert at position
fruits.insert(0, "avocado")  # Insert at index 0

# Extend (add multiple items)
fruits.extend(["grape", "kiwi"])
fruits += ["lemon", "mango"]

# Remove elements
#fruits.remove("apple")       # Remove first occurrence
popped = fruits.pop()         # Remove and return last item
popped = fruits.pop(0)        # Remove and return item at index
del fruits[0]                 # Delete by index
del fruits[1:3]               # Delete slice
fruits.clear()                # Remove all items

##### Key Characteristics
1. Ordered: Items have a defined order that will not change unless explicitly modified.

2. Mutable: You can change, add, and remove items after the list has been created.

3. Diverse Data Types: A single list can contain different types (integers, strings, booleans, or even other lists).

### Dictionary
A dictionary is a built-in data structure used to store data in key-value pairs. Unlike lists, which use indexed positions, dictionaries use unique keys to retrieve specific values.
Definition: Unordered collection of key-value pairs (keys must be unique)
Dictionaries are defined using curly braces {} with keys and values separated by colons.

In [16]:
# Empty dictionary
empty_dict = {}
empty_dict = dict()

# Dictionary with values
student = {
    "name": "Alice",
    "age": 22,
    "grade": "A",
    "courses": ["Math", "CS", "Physics"]
}

# From list of tuples
pairs = [("a", 1), ("b", 2), ("c", 3)]
d = dict(pairs)

# Using dict comprehension
squares = {x: x**2 for x in range(5)}  # {0:0, 1:1, 2:4, 3:9, 4:16}

# From two lists
keys = ["name", "age", "city"]
values = ["Bob", 25, "NYC"]
person = dict(zip(keys, values))

In [17]:
student = {"name": "Alice", "age": 22, "grade": "A"}

# Access by key
print(student["name"])        # "Alice"
print(student.get("age"))     # 22

# get() with default
print(student.get("gpa", 0.0))  # Returns 0.0 if key doesn't exist

# Check if key exists
if "name" in student:
    print(student["name"])

# Get all keys, values, items
print(student.keys())         # dict_keys(['name', 'age', 'grade'])
print(student.values())       # dict_values(['Alice', 22, 'A'])
print(student.items())        # dict_items([('name', 'Alice'), ...])

Alice
22
0.0
Alice
dict_keys(['name', 'age', 'grade'])
dict_values(['Alice', 22, 'A'])
dict_items([('name', 'Alice'), ('age', 22), ('grade', 'A')])


In [18]:
student = {"name": "Alice", "age": 22}

# Add or update
student["grade"] = "A"        # Add new key-value
student["age"] = 23           # Update existing value

# Update multiple items
student.update({"gpa": 3.8, "major": "CS"})

# Remove items
removed = student.pop("age")  # Remove and return value
student.pop("gpa", None)      # With default if key doesn't exist
del student["grade"]          # Delete by key
student.clear()               # Remove all items

# setdefault - add if key doesn't exist
student.setdefault("courses", []).append("Math")

##### Key Characteristics
1. Ordered: As of Python 3.7+, dictionaries maintain the order in which items are inserted.

2. Mutable: You can change, add, or remove items after the dictionary has been created.

3. Unique Keys: Every key must be unique; duplicate keys are not allowed (the latest value will overwrite the old one).

4. Keys must be Hashable: Keys must be an immutable type (like strings, numbers, or tuples), while values can be of any data type.

### Tuples
A tuple is a collection used to store multiple items in a single variable. While they look similar to lists, they have one fundamental difference: they are immutable.
Definition: Ordered, immutable collection of items (can contain duplicates)

In [19]:
# Empty tuple
empty_tuple = ()
empty_tuple = tuple()

# Tuple with values
numbers = (1, 2, 3, 4, 5)
mixed = (1, "hello", 3.14, True)

# Single element tuple (note the comma!)
single = (42,)            # Tuple
not_tuple = (42)          # Just an integer

# Without parentheses (tuple packing)
coordinates = 10, 20, 30  # (10, 20, 30)

# From list
my_list = [1, 2, 3]
my_tuple = tuple(my_list)

# Named tuples (more readable)
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(10, 20)
print(p.x, p.y)           # 10 20

10 20


In [20]:
fruits = ("apple", "banana", "cherry", "date")

# Indexing (same as lists)
print(fruits[0])          # "apple"
print(fruits[-1])         # "date"

# Slicing
print(fruits[1:3])        # ("banana", "cherry")
print(fruits[:2])         # ("apple", "banana")
print(fruits[::2])        # ("apple", "cherry")

# Length
print(len(fruits))        # 4

# Membership
if "banana" in fruits:
    print("Found!")

apple
date
('banana', 'cherry')
('apple', 'banana')
('apple', 'cherry')
4
Found!


In [21]:
# Concatenation
tuple1 = (1, 2, 3)
tuple2 = (4, 5, 6)
combined = tuple1 + tuple2  # (1, 2, 3, 4, 5, 6)

# Repetition
repeated = (0,) * 5         # (0, 0, 0, 0, 0)

# Count and index
numbers = (1, 2, 3, 2, 4, 2)
count = numbers.count(2)    # 3
index = numbers.index(3)    # 2

# Unpacking
x, y, z = (10, 20, 30)
print(x, y, z)              # 10 20 30

# Extended unpacking
first, *middle, last = (1, 2, 3, 4, 5)
# first = 1, middle = [2, 3, 4], last = 5

# Swap variables
a, b = 10, 20
a, b = b, a                 # Swap values

10 20 30


In [22]:
# Why we use tuples
# 1. Immutability guarantees data won't change
coordinates = (10.5, 20.3)
# coordinates[0] = 15  # ERROR! Can't modify

# 2. Use as dictionary keys (lists can't be keys)
locations = {
    (0, 0): "Origin",
    (1, 0): "Point A",
    (0, 1): "Point B"
}

# 3. Return multiple values from function
def get_stats(numbers):
    return min(numbers), max(numbers), sum(numbers)

minimum, maximum, total = get_stats([1, 2, 3, 4, 5])

# 4. Faster than lists
import sys
my_list = [1, 2, 3, 4, 5]
my_tuple = (1, 2, 3, 4, 5)
print(sys.getsizeof(my_list))   # Larger
print(sys.getsizeof(my_tuple))  # Smaller

104
80


##### Key Characteristics
1. Ordered: Items have a defined order that will not change.

2. Unchangeable (Immutable): Once a tuple is created, you cannot add, remove, or change its items.

3. Allow Duplicates: Since they are indexed, they can have multiple items with the same value.

3. Faster: Because they are immutable, Python processes tuples faster than lists.

![image.png](attachment:image.png)

##### WHEN TO USE WHAT?
###### Use Lists when:

1. You need an ordered collection that can change
2. Order matters and you need to access by index
3. You need to sort, append, or modify frequently
Example: Shopping cart items, game scores

###### Use Dictionaries when:

1. You need fast lookups by key
2. Data is naturally paired (key-value)
3. You need to associate data with unique identifiers
Example: User profiles, configuration settings, caching

###### Use Tuples when:

1. Data should not change (immutable)
2. You need hashable objects (dictionary keys)
3. Returning multiple values from functions
4. Representing fixed records (coordinates, RGB values)
Example: Database records, geographic coordinates