<a href="https://colab.research.google.com/github/root-git/stratascratch-sql-challenges/blob/main/5_From_Microsoft_to_Google.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## From Microsoft to Google

Consider all LinkedIn users who, at some point, worked at Microsoft. For how many of them was Google their next employer right after Microsoft (no employers in between)?

**Original Question Link:**  
[StrataScratch ID 2078 – From Microsoft to Google](https://platform.stratascratch.com/coding/2078-from-microsoft-to-google?code_type=1)

---

# Table Schema

#### `linkedin_users`

| Column Name | Data Type | Description                    |
|-------------|-----------|--------------------------------|
| user_id     | bigint    | Unique identifier for a user   |
| employer    | text      | Name of the employer           |
| position    | text      | Job title                      |
| start_date  | date      | Employment start date          |
| end_date    | date      | Employment end date            |

---


# Thought Process

1. Identify all job entries for each user, ordered by start time.
2. Find the job(s) where the user worked at Microsoft.
3. Check if their next chronological job after Microsoft was at Google.
4. Exclude users who had other jobs between Microsoft and Google.
5. Count distinct user_ids that satify the condition.

In [2]:
import pandas as pd

#Create mock data with edge cases
data = {
    "user_id": [
        1, 1,
        2, 2,
        3, 3, 3,
        4,
        5, 5,
        6, 6,
        7, 7
    ],
    "employer": [
        "Microsoft", "Google",
        "Microsoft", "Amazon",
        "Microsoft", "Oracle", "Google",
        "Microsoft",
        "Google", "Microsoft",
        "Microsoft", "Google",
        "Microsoft", "Google"
    ],
    "position": [
        "Engineer", "SWE",
        "PM", "TPM",
        "Analyst", "Analyst", "Analyst",
        "Eng",
        "SWE", "SWE",
        "PM", "PM",
        "SWE",  "SWE"
    ],
    "start_date": [
        "2020-01-01", "2021-01-01",
        "2019-06-01", "2020-07-01",
        "2018-01-01", "2019-01-01", "2020-01-01",
        "2021-01-01",
        "2018-03-01", "2019-04-01",
        "2017-07-01", "2018-07-01",
        "2020-01-01",  "2021-01-01"
    ],
    "end_date": [
        "2020-12-31", "2022-01-01",
        "2020-06-30", "2021-06-30",
        "2018-12-31", "2019-12-31", "2021-01-01",
        "2021-12-31",
        "2019-03-31", "2020-03-31",
        "2018-06-30", "2019-06-30",
        "2021-01-01",   "2022-01-01"
    ]
}

df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')


In [4]:
import sqlite3

# Load into SQLite (in-memory)
conn = sqlite3.connect(":memory:")
df.to_sql("linkedin_users", conn, index=False, if_exists='replace')

#Show preview
print(pd.read_sql("SELECT * FROM linkedin_users", conn))

    user_id   employer  position           start_date             end_date
0         1  Microsoft  Engineer  2020-01-01 00:00:00  2020-12-31 00:00:00
1         1     Google       SWE  2021-01-01 00:00:00  2022-01-01 00:00:00
2         2  Microsoft        PM  2019-06-01 00:00:00  2020-06-30 00:00:00
3         2     Amazon       TPM  2020-07-01 00:00:00  2021-06-30 00:00:00
4         3  Microsoft   Analyst  2018-01-01 00:00:00  2018-12-31 00:00:00
5         3     Oracle   Analyst  2019-01-01 00:00:00  2019-12-31 00:00:00
6         3     Google   Analyst  2020-01-01 00:00:00  2021-01-01 00:00:00
7         4  Microsoft       Eng  2021-01-01 00:00:00  2021-12-31 00:00:00
8         5     Google       SWE  2018-03-01 00:00:00  2019-03-31 00:00:00
9         5  Microsoft       SWE  2019-04-01 00:00:00  2020-03-31 00:00:00
10        6  Microsoft        PM  2017-07-01 00:00:00  2018-06-30 00:00:00
11        6     Google        PM  2018-07-01 00:00:00  2019-06-30 00:00:00
12        7  Microsoft   

In [5]:
# Replace with your SQL query below
query = """ SELECT * FROM linkedin_users"""

result_df = pd.read_sql(query, conn)

In [12]:
query = """
WITH ranked_jobs AS
(
  SELECT
    user_id,
    employer,
    start_date,
    end_date,
    ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY start_date) AS job_order
  FROM linkedin_users
),
microsoft_jobs AS
(
  SELECT
    user_id,
    job_order AS msft_rank
  FROM ranked_jobs
  WHERE employer = 'Microsoft'
),
next_job AS
(
  SELECT
    m.user_id,
    r.employer AS next_employer
  FROM microsoft_jobs m
  JOIN ranked_jobs r
    ON m.user_id = r.user_id
  AND m.msft_rank + 1 = r.job_order
)
SELECT
  COUNT(DISTINCT user_id) AS num_users
FROM next_job
WHERE next_employer = 'Google'
"""

solution = pd.read_sql(query, conn)

In [13]:
# Compare the two results
are_equal = result_df.equals(solution)

# Print result based on the comparison
if are_equal:
    print("Correct!")
else:
    print("Try again!")

Correct!


### Problem Explanation

### Step 1: Rank each user's jobs chronologically
```sql
SELECT
  user_id,
  employer,
  start_date,
  end_date,
  ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY start_date) AS job_order
FROM linkedin_users
```
- For each `user_id`, assign a row number based on ascending `start_date` (earliest bob = 1).
- This allows us to identify the sequency of jobs for each user.

### Step 2: Find users who have worked at Microsoft.
```sql
SELECT
    user_id,
    job_order AS msft_rank
  FROM ranked_jobs
  WHERE employer = 'Microsoft'
```
- Extract only the rows where the employers is Microsoft.
- Keep the `job_order` as `msft_rank`, which tells us where the Microsoft job occurred in that user's timeline.

### Step 3: Join to find the next job after Microsoft
```sql
SELECT
    m.user_id,
    r.employer AS next_employer
  FROM microsoft_jobs m
  JOIN ranked_jobs r
    ON m.user_id = r.user_id
  AND m.msft_rank + 1 = r.job_order
```
- For each Microsoft (`m`), join to the same user's next job (`r`) by checking that `r.job_order =m.msft_rank + 1`.
- This ensures we are only considering the job directly following a Microsoft position.
- We pull the employer name from the next job and call it `next_employer`.

### Step 4: Count how many of those next jobs were at Google
```sql
SELECT
  COUNT(DISTINCT user_id) AS num_users
FROM next_job
WHERE next_employer = 'Google'
```
- From the list of dorect post-Microsoft jobs, we count how many users had Google as their next employer.
- `DISITINCT user_id` ensures that we count unique users, not multiple transitions by the same user.

