Covers all concepts with respect to python for Data analysis and Data Scientist.
Overview Introduction to Python variables and keywords, with practical examples from a Jupyter Notebook (Variables & Keywords.ipynb). Key Points Variables
Temporary storage for values; no need to declare data types in Python. Assignment: Use = (e.g., a = 10 assigns 10 to variable a). Data Types (auto-detected): Integer: a = 10 → type(a) returns <class 'int'>. Float: a = 5.5 → type(a) returns <class 'float'>. String: a = "Hello World" or a = 'Hello World' → type(a) returns <class 'str'>. Boolean: a = True → type(a) returns <class 'bool'> (Note: a = "True" is a string).
Rules: Variable names cannot start with numbers (e.g., 5a = 10 → SyntaxError; but a5 = 10 works). Case-sensitive: name = 2, Name = 4 are different variables.
Type Casting: Convert types: a = 89.9; a = int(a) → 89 (<class 'int'>). a = "5.5"; b = float(a) → 5.5 (<class 'float'>); b = int(float(a)) → 5 (<class 'int'>). a = "hello"; b = int(a) → ValueError (non-numeric strings can't convert to int). round(89.5, 0) → 90.0; int(89.5) → 89.
Installation: Download Python from python.org/downloads (e.g., 3.13) or use Anaconda Navigator for bundled tools (Jupyter, Spyder). Alternative: Use Google Colab for no-install coding with pre-installed Python. MacOS may have Python pre-installed; verify or install if needed.
Keywords
Reserved words with special meaning (e.g., def, print, type, if, else, continue, break, import, None). Cannot be used as variable names, but doing so (e.g., type = 10) is possible though not recommended. In Jupyter Notebook, keywords appear green (e.g., def, print, type).
Overview Introduction to Python data types, operators, and operands, with practical examples from Jupyter Notebook (Datatypes.ipynb).
Key Points Data Types
Classified into five categories: Numeric: Integers, floats, complex numbers. Sequence Types: Strings, lists, tuples. Dictionaries: Key-value pairs. Boolean: True/False. Sets: Unique, unordered collections.
Use type() to check the data type of a variable.
Operators
Symbols used for computations, including:
- (addition)
- (subtraction)
- (multiplication) / (division) // (floor division) % (modulus) ** (exponentiation)
Operands
Values that operators act upon in expressions.
Order of Precedence
Follows PEMDAS rule: Parentheses Exponentiation Multiplication/Division Addition/Subtraction
Installation
Install Python via: python.org Anaconda Navigator Google Colab (no-install option for coding)
Input Function
input() function captures user input, always returning data as a string.
Overview Lists are mutable, ordered sequences of mixed data types, enclosed in square brackets []. They support various operations like indexing, slicing, concatenation, and modification. Tuples, in contrast, are immutable and use parentheses (). This document includes examples from the provided Lists.ipynb and additional code snippets.
Key Points
Lists Ordered, mutable, allow mixed data types (integers, floats, strings, nested lists). Defined using square brackets []. Example: my_list = [1, 2.5, "hello", [3, 4]]
Tuples Ordered, immutable, used for fixed data. Defined using parentheses () or comma-separated values. Example: my_tuple = (1, 2, "world")
my_tuple = 1, 2, "world" # parentheses optional
Sets Unordered, unique elements. Defined using curly braces {}. Example: my_set = {1, 2, 3, 3} # Output: {1, 2, 3} Indexing Zero-based indexing. Access elements using list[index]. Negative indexing starts from the end: fruits = ["apple", "banana", "cherry"] print(fruits[0]) # apple print(fruits[-1]) # cherry Slicing Extract sublists using list[start:end]. Omitting start → from beginning; omitting end → to end. numbers = [0, 1, 2, 3, 4, 5] print(numbers[1:4]) # [1, 2, 3] print(numbers[:3]) # [0, 1, 2] print(numbers[3:]) # [3, 4, 5]
Common List Operations:
- Concatenation → list1 + list2 → Combine two lists
- Append → list.append(item) → Add item to end
- Extend → list.extend([items]) → Add multiple items
- Insert → list.insert(index, item) → Insert at specific position
- Remove → list.remove(item) → Remove first occurrence
- Pop → list.pop(index) → Remove and return item
- Sort → list.sort() → Sort in place
- Sorted → sorted(list) → Return sorted copy
Mutability Lists: Elements can be modified after creation. my_list = [1, 2, 3] my_list[0] = 99 # OK → [99, 2, 3]
Tuples & Strings: Immutable — cannot change individual elements.
Membership Testing Use in to check if an element exists: fruits = ["apple", "banana"] print("banana" in fruits) # True
Shallow Copy vs. Reference Assignment creates a reference: A = [1, 2, 3] B = A # B is a reference to A A[0] = 99 print(B) # [99, 2, 3] → changed too!
Create independent copy:
B = A[:] # slicing
B = A.copy() # copy() method
import copy B = copy.copy(A) # for shallow copy
Tip: Use copy() or slicing [:] to create independent list copies and avoid unintended side effects.
Tuples are ordered, immutable sequences of mixed data types, defined with parentheses () or comma-separated values.
They are similar to lists but immutable, making them suitable for fixed data.
This document summarizes tuple characteristics, differences from lists, and practical examples from the provided code and transcript.
- Tuples: Ordered, immutable, allow mixed data types (integers, floats, strings, nested tuples/lists). Defined with parentheses
()or commas. - Lists: Ordered, mutable, defined with square brackets
[].
- Syntax: Tuples use
(), lists use[]. - Mutability: Tuples are immutable (cannot change elements), lists are mutable.
- Methods: Tuples have ~33 methods, lists have 40+.
- Use Cases: Use tuples for fixed data (e.g., passport details), lists for dynamic data (e.g., customer info).
- Dictionary Keys: Tuples are hashable (can be dictionary keys), lists are not.
- Indexing: Zero-based (e.g.,
tuple[0]) or negative (e.g.,tuple[-1]for the last element). - Slicing: Extract sub-tuples (e.g.,
tuple[0:2]).
- Concatenation: Use
+to combine tuples. - Functions:
min(),max(),sum(). - Membership: Use
into check if an element exists in a tuple.
- Elements cannot be modified directly.
- To update, use concatenation or type casting (convert to list, modify, and convert back to tuple).
- Tuples cannot be sorted in place.
- Convert to list, sort, then convert back to tuple.
- Tuples can contain other tuples or lists as elements.
Sets are unordered collections of unique elements, defined with curly braces {}.
They are ideal for tasks requiring distinct values, such as finding unique items or performing set operations like union, intersection, and difference.
This document summarizes set characteristics, comparisons with lists and tuples, and practical examples from the provided sets.ipynb and transcript.
-
Sets
- Unordered
- Mutable (can add or remove elements)
- Contain only unique elements
- Defined with curly braces
{}
-
Lists
- Ordered
- Mutable
- Allow duplicates
- Defined with square brackets
[]
-
Tuples
- Ordered
- Immutable
- Allow duplicates
- Defined with parentheses
()or commas
- Sets do not allow duplicates — converting a list to a set automatically removes duplicates.
- Sets are unordered, so indexing (e.g.,
set[0]) is not supported. - Membership testing (
in) is faster in sets because they use hash tables internally. - Sets support various mathematical operations:
- Union (
|) — Combines all unique elements from both sets. - Intersection (
&) — Elements common to both sets. - Difference (
-) — Elements in one set but not in the other. - Symmetric Difference (
^) — Elements present in either set but not both.
- Union (
- Removing duplicates from data collections.
- Checking unique elements in datasets (e.g., unique grades, user IDs).
- Performing set operations for comparisons or filtering common elements.
- Efficient membership testing — e.g., checking if a value exists in a dataset.
| Feature | List | Tuple | Set |
|---|---|---|---|
| Ordered | ✅ Yes | ✅ Yes | ❌ No |
| Mutable | ✅ Yes | ❌ No | ✅ Yes |
| Allows Duplicates | ✅ Yes | ✅ Yes | ❌ No |
| Syntax | [] |
() or commas |
{} |
| Indexing Supported | ✅ Yes | ✅ Yes | ❌ No |
| Hashable | ❌ No | ✅ Yes | ❌ No (cannot be keys) |
Use:
- Lists for ordered, changeable collections with duplicates.
- Tuples for fixed, immutable collections.
- Sets for unique, unordered data and efficient membership testing.
Note: Sets are not hashable, so they cannot be used as dictionary keys, whereas tuples can.
Loops in Python are used for repetitive tasks, iterating over iterables like lists, tuples, strings, dictionaries, and sets.
This document covers the basics of loops (for and while), conditional statements (if, elif, else), and comprehensions, with practical examples from the provided transcript and loops.ipynb.
- Iterables: Objects that can be iterated over (e.g., lists, tuples, strings, dictionaries, sets).
- Iterator: A variable that traverses each element in an iterable.
- For Loop: Iterates over a sequence or range, executing until the sequence is exhausted.
- While Loop: Executes as long as a condition is true, requiring manual counter updates.
- if, elif, else are used for decision-making based on conditions.
- Concise alternatives to
forloops for creating lists or dictionaries, faster and more compact.
forloops automatically handle iteration (no manual increment likei++in other languages).- Comprehensions reduce code length and improve performance compared to traditional
forloops. - Use
.items()for dictionary iteration to access keys and values.
Conditional statements are often used within loops for complex logic.
Functions in Python are named sequences of statements that perform specific tasks, improving code reusability and modularity.
This document covers user-defined functions, lambda functions, and their applications, with examples from the provided transcript and functions.ipynb.
It also highlights the differences between traditional and lambda functions, emphasizing their use cases and benefits.
- Functions: Named blocks of code that execute specific tasks when called, defined using the
defkeyword.
- Built-in Functions: Predefined in Python (e.g.,
type(),len(),int()). - User-Defined Functions: Created by developers to perform custom tasks (e.g., calculating BMI or checking even/odd).
- Lambda Functions: Small, anonymous functions defined with the
lambdakeyword, ideal for single-expression tasks.
- Name: Unique identifier for the function (e.g.,
even_odd). - Arguments: Inputs passed to the function (optional, can have default values).
- Body: Code block that performs the task.
- Return: Optional output of the function.
- Syntax:
lambda arguments: expression
- Contain a single expression, no multi-line logic.
- Reduce code complexity and improve performance for simple operations.
- Reusability: Call functions multiple times without rewriting code.
- Modularity: Organize code into manageable, reusable blocks.
- Maintainability: Easier to update and deploy code in production environments.
Allow functions to use preset values if arguments are not provided.
- Calculating factorials
- Summing natural numbers
- Checking conditions (e.g., even/odd)
- Simplifying repetitive tasks
Map, reduce, and filter functions are functional programming tools in Python that simplify code by reducing the need for explicit loops and branching.
They are efficient alternatives to traditional for loops, offering concise syntax and lower computational overhead.
This document summarizes their definitions, use cases, and examples from the provided transcript and map_reduce_filter.ipynb.
- Map, Reduce, Filter: Built-in functions for processing iterables (e.g., lists, tuples) in a functional programming style.
- Map: Applies a function to each element in an iterable, returning a new collection.
- Filter: Extracts elements from an iterable that satisfy a condition.
- Reduce: Combines elements of an iterable into a single result using pairwise operations.
- Reduce code length and complexity compared to
forloops. - Improve performance (lower time complexity) for certain tasks.
- Enhance readability with concise, expressive syntax.
- Map:
map(function, iterable) - Filter:
filter(function, iterable) - Reduce:
reduce(function, iterable)(requiresfrom functools import reduce)
Often used with map, filter, and reduce to define inline operations, reducing the need for named functions.
- Map: Transform data (e.g., convert strings to uppercase, compute areas).
- Filter: Extract elements based on conditions (e.g., values above average, non-null values).
- Reduce: Aggregate data (e.g., multiply or sum all elements).
reduceis deprecated in Python’s built-in namespace but available in thefunctoolsmodule.- These functions are less common in data analytics but useful for specific tasks.
- Always convert map/filter results to a list (e.g.,
list(map(...))) to view output.
File handling in Python involves operations like creating, reading, writing, updating, and deleting files.
It is crucial for web applications and certain programming tasks, though data science often relies on libraries like pandas for file operations.
This document summarizes file handling methods, modes, and best practices, with examples from the provided transcript and file_handling.ipynb.
- File Handling: The process of managing files (e.g., reading, writing, appending) in Python.
- Open: Access a file with a specified mode (e.g., read, write, append).
- Read: Retrieve content from a file.
- Write: Add or overwrite content in a file.
- Append: Add content to the end of a file.
- Close: End the file session to free resources.
'r': Read (default, fails if file is not readable).'w': Write (overwrites file, fails if file is not writable).'a': Append (adds to end of file).'a+': Append and read (allows both appending and reading).
- Always close files after operations to free resources (
file.close()or usewithstatement). - Use
withstatement for automatic file closure. - Specify correct permissions to avoid errors (e.g., cannot read a file opened in
'w'mode). - Use
\nto add line breaks when writing/appending.
Libraries like pandas simplify file reading (e.g., CSVs, Excel) compared to traditional file handling.
Control structures in Python guide program flow by analyzing variables and making decisions based on conditions or iterating over data.
This document covers binary and relational operators, decision-making with if-else, iteration with loops, comprehensions, and functional programming tools (map, filter, reduce), with examples from the provided transcript and control_structures.ipynb.
-
Control Structures:
Blocks that analyze variables and direct program flow based on conditions (e.g.,if-else) or iteration (e.g.,for,whileloops). -
Binary Operators:
Operate on two operands (e.g.,a + b, whereaandbare operands). -
Relational Operators:
Compare two values, returningTrueorFalse(e.g.,==,!=,>,<,>=,<=). -
Decision Making:
Useif,elif,elseto execute code based on conditions.x = 10 if x > 5: print("Greater than 5") elif x == 5: print("Equal to 5") else: print("Less than 5")
- Traditional programming: Read/write text files, logs, or configurations.
- Data science: Use pandas for structured data (e.g., CSVs) instead of manual file handling.
Object-Oriented Programming (OOPs) is a paradigm that models real-world entities using classes and objects. It focuses on bundling data and methods together, ensuring modularity, reusability, and abstraction.
A class is a blueprint for creating objects. It defines attributes (data) and methods (functions) that describe behavior.
An object is an instance of a class. Each object has its own data but shares the class’s structure and behavior.
Encapsulation bundles data and methods into one unit and restricts direct access to some components, maintaining data integrity.
Inheritance allows one class (child) to acquire the properties and methods of another class (parent), promoting code reuse and hierarchy.
Polymorphism allows different classes to define methods with the same name but different behavior, enabling flexibility and scalability.
Abstraction hides implementation details and shows only the necessary features, simplifying interaction with complex systems.
- Class Variables: Shared among all objects of the class.
- Instance Variables: Unique to each object.
Methods define the behavior of a class and typically operate on instance variables using the self keyword.
OOPs provides structure and clarity in programming through:
- Organized code via classes and objects
- Reusability using inheritance
- Flexibility with polymorphism
- Data protection via encapsulation
- Simplification through abstraction
NumPy (Numerical Python) is a powerful library for numerical computations in Python. It provides support for multi-dimensional arrays and matrices, along with optimized mathematical functions to operate on them efficiently.
- Speed: NumPy arrays are much faster than Python lists because they are implemented in C.
- Memory Efficiency: Arrays use less memory compared to lists.
- Functionality: Includes advanced mathematical, statistical, and array manipulation tools.
- Scalability: Suitable for large datasets in data science and analytics.
Install NumPy using:
pip install numpyNumPy is generally imported as:
import numpy as npThis convention simplifies function calls and ensures consistency in Python programs.
Pandas is a fast, flexible, and powerful Python library for data analysis and manipulation. It handles structured data efficiently using DataFrames and Series.
- Speed: Optimized for large datasets.
- Flexibility: Supports multiple file formats (CSV, Excel, JSON, etc.).
- Ease of Use: Simplifies data handling.
- Data Science: Essential for cleaning, exploring, and analyzing data.
pip install pandasimport pandas as pd
import numpy as npdf = pd.read_csv("path/to/Churn_Modelling.csv")
df.head()
df.info()
df.describe()df['NewSalary'] = df['EstimatedSalary'] * 1.1
df['FullName'] = df['CustomerId'].astype(str) + ' ' + df['Surname']
df['Bal_SQRT'] = df['Balance'].apply(np.sqrt)filtered_df = df[df['Age'] >= 50]
df_sorted = df.sort_values(by=['Age', 'Tenure'], ascending=[False, True])dfa = pd.read_csv("path/to/Test.csv")
dfa_clean = dfa.dropna()
dfa['Age'] = dfa['Age'].fillna(dfa['Age'].median())df.pop('RowNumber')
df = df.drop(columns=['Surname', 'CreditScore'])geo_mean = df.groupby('Geography').mean(numeric_only=True)
geo_gender_mean = df.groupby(['Geography', 'Gender'])['Balance'].mean()result = pd.concat([df1, df2])
merged = pd.merge(df1, df2, on='cust_id', how='inner')Matplotlib works well with Pandas DataFrames, which are often used to prepare data for visualization.
Example: Creating a DataFrame from a Dictionary
import pandas as pd
Data = {'Year': [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020],
'Exchange Rate': [65, 69, 71, 64, 62, 59, 72, 71, 75, 78, 81]}
df = pd.DataFrame(Data)df.plot(x='Year', y='Exchange Rate', kind='line')
plt.show()df.plot(x='Year', y='Exchange Rate', kind='area')
plt.show()df.plot(x='Year', y='Exchange Rate', kind='bar')
plt.show()df.plot(x='Year', y='Exchange Rate', kind='scatter')
plt.show()Data = {'Tasks': [100, 500, 300]}
df2 = pd.DataFrame(Data, columns=['Tasks'], index=['Pending', 'Completed', 'Ongoing'])
df2.plot.pie(y='Tasks', figsize=(5, 5))
plt.show()labels = ['Java', 'Python', 'R', 'Javascript']
sizes = [15, 30, 45, 10]
explode_labels = (0, 0.2, 0, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode_labels, labels=labels, shadow=True, startangle=90)
ax1.axis('equal')
plt.show()churn_df = pd.read_csv('Churn_Modelling.csv')
churn_df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)churn_df['Geography'].value_counts().plot(kind='bar')
plt.show()plt.scatter(churn_df['Age'], churn_df['Tenure'])
plt.show()plt.hist(churn_df['Tenure'], bins=30)
plt.show()churn_df['Age'].plot.box()
plt.show()import seaborn as sns
sns.countplot(x='Geography', data=churn_df)
plt.show()Seaborn is a Python visualization library built on top of Matplotlib.
It simplifies creating statistical and attractive plots and integrates well with Pandas DataFrames.
It also includes built-in datasets like Iris and Flights for practice.
- Simplifies complex plots (heatmaps, pairplots)
- Focused on statistical visualizations
- Comes with aesthetic default styles
- Works seamlessly with Pandas
- Includes sample datasets for learning
pip install seaborn
Count Plot: Shows frequency of categories
KDE Plot: Displays data distribution
Histplot: Combines histogram + KDE
Pair Plot: Shows pairwise relationships
Line Plot: Visualizes trends
Box Plot: Shows spread and outliers
Heatmap: Visualizes correlations or matrices
Apply color palettes (Spectral, coolwarm)
Add titles, labels, and legends using Matplotlib
Use sns.set_theme() for unified styling
Correlation Heatmap: Shows relationships between features
Pairplot with Hue: Highlights categories
Jointplot / Lmplot: Shows two-variable relationships
Plotly (Interactive)
Use Plotly for interactive dashboards and dynamic visualizations.
Seaborn = Easy + Beautiful + Statistical
Ideal for Exploratory Data Analysis (EDA)
Plotly → for interactivity
Matplotlib → for customization