<a href="https://colab.research.google.com/github/victorialugo012/hacker_news_analysis.sql/blob/main/_Victoria_Lugo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python Basics**
Let's start by learning how to work with Python **variables**, **data types**, and **basic operations**.

## **Variables and Data Types**
- In Python, a **variable** is a container for storing data values.
- Python supports several **data types**, including:
  - **Integers**
  - **Floats**
  - **Strings**
  - **Lists**


In [51]:
name = 'victoria'
print(name)

victoria


## **Basic Operations**
- Python allows you to perform **arithmetic operations** on numbers and **concatenate strings**.

In [1]:
#arithmetic
print(5+1)
print(1+2)

6
3


In [3]:
#concatenate
print('Hello'+' '+'Victoria')

Hello Victoria


## **Conditional Statements**
- **Conditional statements** (if-else) let you execute different blocks of code based on a condition.

In [6]:
age = 18
if age >= 18:
  print('You are an adult')
else:
    print('You are not an adult')

You are an adult


## **Loops**
- **Loops** are used to repeat a block of code multiple times.

In [7]:
for i in range(5):
  print(i)

0
1
2
3
4


### **Working with Data**
Python provides powerful tools to work with data. Let's start by exploring **lists** and **dictionaries**.

### **Lists**
- **Lists** are used to store multiple items in a single variable.


In [12]:
fruits = ['apple', 'pear' , 'kiwi']
for fruit in fruits:
  print(fruit)

apple
pear
kiwi


### **Dictionaries**
- **Dictionaries** store data in **key-value pairs**, making it easy to organize related information.

In [13]:
fruits = {'apple': 'red',
          'kiwi': 'green',
          'pear': 'yellow'
}
print(fruits['apple'])

red


# **Data Understanding & Management for Business Intelligence**

### **Importance of Data:**
Data is the foundation of Business Intelligence (BI), analytics, and data science. It's like the raw material needed to build insights and knowledge. Just as a chef needs ingredients to create a dish, BI requires data to generate insights.

- **Decision-Making:** Data enables informed decisions, like analyzing sales data to identify best-selling products and focus marketing efforts.

- **Trend Analysis:** Historical data analysis can reveal trends, helping predict future demands, as seen in seasonal sales patterns.

- **Customer Insights:** Collecting customer feedback and behavior data helps tailor products and services, enhancing customer satisfaction and loyalty.

- **Operational Efficiency:** Data analysis can streamline operations, such as optimizing supply chain logistics based on real-time inventory data.

- **Risk Management:** Analyzing financial and operational data helps identify and mitigate risks, like fraud detection through unusual transaction patterns.


### **Characteristics of Data:**

- **Size:** Data's scale impacts storage and processing. Small datasets might fit in a single spreadsheet, whereas Big Data, like internet traffic logs, requires robust databases and computing power.

- **Structure:** Structured data, such as a customer database, follows a clear format. Unstructured data, like customer reviews, lacks a predefined format and needs more complex processing.

- **Flow:** Data can be continuous, like sensor outputs in a smart factory, or batched, such as monthly sales reports, affecting how and when data is analyzed.


### **Making Data Analytics-Ready:**

- **Relevance and Quality:** Data must be relevant to the specific problem and of high quality.

- **Structure:** It should be organized in a way that's ready for analysis, with key fields and normalized values.

- **Common Definitions:** Terms and variables should have agreed-upon definitions across the organization (Master Data Management).
    - Date formats varying between MM/DD/YYYY and DD/MM/YYYY.
    - Using both abbreviations and full names for states, like "CA" and "California."


### **Analytics Readiness Metrics:**

- **Reliability:** Ensuring data comes from credible sources and is stored securely, like using authenticated APIs for data collection.

- **Accuracy:** Verifying data matches real-world values, such as ensuring sales figures align with transaction records.

- **Accessibility:** Making data easily available to authorized users through user-friendly platforms or databases.

- **Security and Privacy:** Implementing measures like encryption and access controls to protect data from breaches.

- **Richness:** Gathering comprehensive data sets that cover all aspects needed for analysis, like customer demographics, behaviors, and transaction histories.

- **Consistency:** Standardizing data collection and integration methods to ensure uniformity across all data sources.



### **Understanding Data Taxonomy for Analytics**

**Structured Data:** This includes data organized in a fixed format, making it easily searchable, such as in databases. It is further divided into:

- **Categorical Data:** Data that represents categories or groups. It includes:
  - **Nominal Data:** Data without a natural order, like colors or marital status.
    - *Nominal Data Example:* Consider a Python list holding colors: **`colors = ["Red", "Blue", "Green"]`**. This list represents nominal data as there's no inherent order among these colors.
  - **Ordinal Data:** Data with a natural order but no fixed spacing, like education levels or customer satisfaction ratings.
    - *Ordinal Data Example:* An education level list: **`education_levels = ["High School", "Bachelor's", "Master's", "Ph.D."]`** shows ordinal data. The list has a natural order from least to most advanced education level, but the spacing between each level isn't defined.

- **Numeric Data:** Data represented by numbers. It includes:
  - **Interval Data:** Numerical data without a true zero point, such as temperature.
    - *Interval Data Example:* Temperature readings in Celsius: **`temperatures = [20, 22, 25, 28]`**. These temperatures are interval data as they are numerical with equal spacing but lack a true zero point (where 0 does not mean 'no temperature').
  - **Ratio Data:** Numerical data with a true zero point, allowing for comparisons of magnitudes, like height or weight.
    - *Ratio Data Example:* Weights of objects in kilograms: **`weights = [0, 0.5, 1, 1.5]`**. This list is an example of ratio data because it includes numerical values with a true zero point, indicating the absence of weight.

**Unstructured Data:** Data that does not have a predefined data model, making it harder to collect, process, and analyze. Examples include text documents, images, and videos.

**Semistructured Data:** A blend of structured and unstructured data, which can include elements of both, like XML or JSON files used in web data.



#### Summary of the **Mathematical Operations** applicable to each level of data taxonomy:

| Data Type | Countable | Comparable | Addable | Subtractable | Multiplicable | Dividable |
|-----------|-----------|------------|---------|--------------|---------------|-----------|
| Nominal   | Yes       | No         | No      | No           | No            | No        |
| Ordinal   | Yes       | Yes        | No      | No           | No            | No        |
| Interval  | Yes       | Yes        | Yes     | Yes          | No            | No        |
| Ratio     | Yes       | Yes        | Yes     | Yes          | Yes           | Yes       |


>> For example, 20°C is not "twice as warm" as 10°C because the zero point in the Celsius scale is arbitrary, not representing an absence of temperature.

#### Summary of the **Statistical Operations** applicable to each level of data taxonomy:

| Data Type | Mode | Median | Mean | Standard Deviation | Correlation |
|-----------|------|--------|------|--------------------|-------------|
| Nominal   | Yes  | No     | No   | No                 | No          |
| Ordinal   | Yes  | Yes    | No   | No                 | No          |
| Interval  | Yes  | Yes    | Yes  | Yes                | Yes         |
| Ratio     | Yes  | Yes    | Yes  | Yes                | Yes         |


In [16]:
colors = ["Red", "Blue", "Green"]
print(colors)


['Red', 'Blue', 'Green']


In [17]:
education_levels = ["High School", "Bachelor's", "Master's", "Ph.D."]
print(education_levels)


['High School', "Bachelor's", "Master's", 'Ph.D.']


In [18]:
# Interval data (temperature, no true zero)
temperatures = [20, 22, 25, 28]
print(temperatures)

# Ratio data (weights, true zero exists)
weights = [0, 0.5, 1, 1.5]
print(weights)


[20, 22, 25, 28]
[0, 0.5, 1, 1.5]


# Introduction to Pandas for Business Intelligence

- Pandas is a cornerstone tool for anyone studying Business Intelligence (BI), providing powerful and easy-to-use data structures and data analysis tools for Python. Its name, derived from "panel data" and "Python data analysis," hints at its capabilities and intended use. Here's a brief introduction to get you started:

## Why Pandas?

- In the realm of Business Intelligence, data is king. You're often tasked with collecting, cleaning, analyzing, and reporting data to inform strategic decisions. Pandas excel in handling and transforming data, making it an indispensable tool for BI students. It allows you to read data from various sources, manipulate it efficiently, and prepare insightful visualizations.

## Key Features:

### Data Structures:
- Pandas introduce two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like structure, while a DataFrame is a two-dimensional, table-like structure with rows and columns. These structures are designed to handle data in a way that is intuitive and aligned with spreadsheet-like operations, which are familiar to most BI professionals.

### Data Manipulation:
- With Pandas, you can easily filter, sort, group, and aggregate your data. Whether you're dealing with time-series data, categorical data, or missing values, Pandas provides a rich set of methods to perform data munging tasks efficiently.

### Integration with Data Sources:
- Pandas can seamlessly read and write data from and to various sources like CSV files, Excel spreadsheets, databases, and even cloud storage services. This versatility makes it a great tool for BI tasks, where data might come from disparate sources.

### Data Analysis:
- Beyond data manipulation, Pandas also supports more complex data analysis tasks. It integrates well with other libraries like NumPy for numerical operations, Matplotlib and Seaborn for data visualization, and Scikit-learn for machine learning, making it a central part of the Python data science ecosystem.


**Loading the Dataset**

   First, you need to have the dataset in a format that Pandas can read, such as a CSV file. Assuming the data you provided is saved in a file named `university_data.csv`, you can load it into a Pandas DataFrame.


In [35]:
import pandas as pd
from google.colab import files
uploaded = files.upload()  # file




Saving university_data-1.csv to university_data-1 (4).csv


In [44]:
df = pd.read_csv("university_data-1 (4).csv")

**Understanding the Data**

Once the data is loaded, you can perform various operations to understand its structure, content, and characteristics.

**Data Overview:** Use `df.info()` to get a concise summary of the DataFrame, including the number of non-null entries, data types, and memory usage.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Private      1000 non-null   object 
 1   Outstate     1000 non-null   int64  
 2   PhD          1000 non-null   int64  
 3   Apps         1000 non-null   int64  
 4   Accept       1000 non-null   int64  
 5   Enroll       1000 non-null   float64
 6   Top10perc    1000 non-null   int64  
 7   Grad.Rate    1000 non-null   float64
 8   Region       1000 non-null   object 
 9   Temperature  1000 non-null   int64  
 10  Reputation   1000 non-null   object 
dtypes: float64(2), int64(6), object(3)
memory usage: 86.1+ KB


**Descriptive Statistics:**

Use `df.describe()` to view summary statistics that help describe each numeric column's central tendency, dispersion, and shape of the dataset's distribution.


In [47]:
df.describe()

Unnamed: 0,Outstate,PhD,Apps,Accept,Enroll,Top10perc,Grad.Rate,Temperature
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,23772.946,75.067,2687.221,2187.186,774.628,51.859,87.294761,18.579
std,8072.784115,12.989351,1003.399146,1003.356255,412.085275,17.785484,8.391001,8.561487
min,10007.0,50.0,1005.0,486.0,119.0,20.0,62.591212,0.0
25%,17834.5,66.0,1970.75,1460.75,464.25,40.0,81.216268,12.0
50%,23113.0,75.0,2514.5,2023.0,717.0,51.0,87.320361,20.0
75%,29882.5,85.0,3367.75,2860.25,993.5,64.0,93.284037,25.0
max,39994.0,99.0,4999.0,4538.0,2187.0,89.0,113.043843,34.0


**Check for Missing Values:**

It's important to identify if the dataset contains any missing values.


In [48]:
df.isnull().any()


Unnamed: 0,0
Private,False
Outstate,False
PhD,False
Apps,False
Accept,False
Enroll,False
Top10perc,False
Grad.Rate,False
Region,False
Temperature,False


**Unique Values in Categorical Columns:**

For columns like 'Private', 'Region', and 'Reputation', checking unique values can be insightful.


In [49]:

print(df['Private'].unique())
print(df['Region'].unique())
print(df['Reputation'].unique())


['Yes' 'No']
['West' 'North' 'East' 'South']
['Good' 'Moderate' 'Poor']


Using the 'Private' column in the dataset, perform the following tasks:


- Count how many universities are private versus public.



In [50]:

private_counts = df['Private'].value_counts()
print(private_counts)


Private
Yes    503
No     497
Name: count, dtype: int64


This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr

In [55]:
!jupyter nbconvert --to html /content/victoria_lugo.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr

In [None]:
from google.colab import drive
drive.mount('/content/drive')